Bug 55177 - URI<->IRI conversion uses page encoding instead of UTF-8
Summary: URI<->IRI conversion uses page encoding instead of UTF-8
Status: RESOLVED FIXED
Alias: None
Product: kdelibs
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: unspecified
Platform: Debian testing Linux
: NOR normal
Target Milestone: ---
Assignee: Thiago Macieira
URL:
Keywords:
: 89536 106195 106216 108384 110330 115245 115247 117518 123149 (view as bug list)
Depends on:
Blocks:
 
Reported: 2003-02-25 21:42 UTC by Martin J. Dürst
Modified: 2011-07-26 13:19 UTC (History)
10 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
OpenPGP digital signature (252 bytes, application/pgp-signature)
2007-05-16 19:47 UTC, Micah Cowan
Details
solve bug problem (745 bytes, patch)
2007-05-22 10:40 UTC, stanv
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin J. Dürst 2003-02-25 21:42:42 UTC
Version:            (using KDE KDE 3.1)
Installed from:    Debian testing/unstable Packages
OS:          Linux

Konqueror does not behave correctly on the tests
linked from http://www.w3.org/2001/08/iri-test/.

In more detail, on a page that is encoded in
iso-8859-1 with an <a href=""> or <img src="">
that does include non-US-ASCII characters, these
characters are not converted to UTF-8 before being
hex-escaped, as described for HTML at
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1
and in general in draft-duerst-iri-02.txt (at
http://www.ietf.org/internet-drafts/draft-duerst-iri-02.txt
Internationalized Resource Identifiers (IRIs)).

It also makes Konqueror's behavior different from
the majority of the installed browser base, and
from the general direction of technology.
Comment 1 Waldo Bastian 2003-06-05 18:28:19 UTC
It would be easier if there was an indication that an iri is in fact an iri and not a 
misguided uri. The web is full with misguided uri's so we can not afford to support 
iri's at the expense of these uris. The recommendation to first try a utf8 encoding and 
then the page encoding is highly impractical for us. 
Comment 2 Thiago Macieira 2003-06-05 23:02:58 UTC
I agree. IMO, the best method for handling these situations is to take the character 
such as it was read from the page, in the encoding it was sent. Treat URIs as binary 
8-bit data such as how we read it from the HTML page -- i.e., no conversion 
whatsoever. 
 
I can't be sure now KDE behaves correctly. I'd have to check. But that would 
introduce the same problem as what we've discussed in #56071: data that doesn't 
translate into Unicode. 
Comment 3 Thiago Macieira 2003-06-05 23:11:24 UTC
Just getting back: if the link reference contains %HH hex encoding, Konqueror 
loads the names correctly. That is, the URIs are treated as binary 8-bit data. 
However, the status line shows a Latin 1-decoding, which should not happen. 
 
If, however, the reference contains non-ASCII characters that are invalid in the 
page's charset, Konqueror chokes and attempts to load invalid URIs. In special, it 
starts to load stuff containing the UTF-8 "character replacement" character. But, I 
don't believe this specific problem to be of great importance, since the pages 
containing those invalid code sequences will be invalid at source as well. 
Comment 4 Martin J. Dürst 2004-12-20 08:11:22 UTC
The IRI spec (currently at
http://www.w3.org/International/iri-edit/draft-duerst-iri-11.txt)
has been approved as an IETF Proposed Standard by the IESG (see
http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00752.html)
and will soon be issued as an RFC.

In short, IRI spec defines how to convert non-ASCII characters in a
Web address to a fully standard URI, using UTF-8 and %HH-encoding.

Given that this is now approved, KDE code should move towards
implementing this without delay. In particular, this means that any
URIs that contain non-ASCII characters should be converted to
%HH-form for resolution using UTF-8. Using a legacy encoding is
(and always has been) a clearly totally non-standard way of
resolving Web addresses with non-ASCII characters.

Opera and Safari already have a good implementation of IRIs.
IE does so, too, with the exception of IDNs.

This is not about stuff that is invalid in the page's charset
(which makes the whole page invalid indeed), but about characters
that are valid in the page's charset.

Waldo Bastian talks about 'misguided URIs'. Well, that's what
they are: misguided. The best thing for the Web that can happen
with them is that they fail, then they'll disappear. No spec
ever said they'd work.
Comment 5 Thiago Macieira 2005-02-13 03:37:35 UTC
*** Bug 89536 has been marked as a duplicate of this bug. ***
Comment 6 Thiago Macieira 2005-05-25 05:20:15 UTC
*** Bug 106216 has been marked as a duplicate of this bug. ***
Comment 7 Nicolas Goutte 2005-05-25 13:52:29 UTC
See also bug #106195 (I let other people decide if it is the same problem or not).
Comment 8 Thiago Macieira 2005-05-25 13:58:32 UTC
*** Bug 106195 has been marked as a duplicate of this bug. ***
Comment 9 Thiago Macieira 2005-07-01 12:54:47 UTC
*** Bug 108384 has been marked as a duplicate of this bug. ***
Comment 10 Thiago Macieira 2005-08-09 07:17:05 UTC
*** Bug 110330 has been marked as a duplicate of this bug. ***
Comment 11 Thiago Macieira 2005-10-28 12:11:43 UTC
*** Bug 115247 has been marked as a duplicate of this bug. ***
Comment 12 Thiago Macieira 2005-10-28 12:11:51 UTC
*** Bug 115245 has been marked as a duplicate of this bug. ***
Comment 13 Thiago Macieira 2005-12-02 22:35:39 UTC
*** Bug 117518 has been marked as a duplicate of this bug. ***
Comment 14 Thiago Macieira 2006-03-07 20:23:15 UTC
*** Bug 123149 has been marked as a duplicate of this bug. ***
Comment 15 Micah Cowan 2007-05-16 10:58:54 UTC
I'd like to point out that currently the file:// scheme seems to be significantly more broken than the http:// scheme, as path components (not hostnames) that include non-ASCII chars within an http:// URL appear to be interpreted fine (provided that they are specified directly or via entity references, and not percents); whereas the same path component appearing in a file:// link is apparently translated into UTF-8, and then translated /back/ into the page's native encoding, resulting in a broken link.

Perhaps some of the bugs that have to do with this specific behavior, and have been marked as duplicates of this bug, should be considered separately? At least, it doesn't appear to be a general IRI problem so much as a problem with the file:// URI scheme interpretation, specifically.

https://bugs.launchpad.net/ubuntu/+source/kdebase/+bug/50213

Has much more detail on what I'm referring to. It doesn't reference the fact that those same links work if "file:///" is replaced with "http://foo/", but I've tested it. Compare http://micah.cowan.name/50213/test.html.latin1 with http://micah.cowan.name/50213/test.http.html.latin1 (not technically valid HTML; but the results hold true with conforming code as well).
Comment 16 Thiago Macieira 2007-05-16 13:22:34 UTC
I am aware of the issue. But, as far as I know, there's no proper solution for this problem. So, I'm letting the problem remain unsolved.

IRIs (which include local files) indicate that URLs are to be interpreted as a sequence of Unicode characters, encoded in UTF-8. That means that, if you write
  <a href="müller.html"> or even <a href="m&uuml;ller.html">
The URI fragment contains one non-ASCII Unicode codepoint (U+00FC, independent of the page's encoding). That is correctly translated to "m%C3%BCller.html". Let me repeat: *correctly* translated.

One possible solution is to mandate that people write HTML pages referring to local files using %-encoding if their files aren't named in UTF-8. That would be suboptimal because the URLs in Konqueror's Location field don't exactly correspond to the directory names displayed down below. It would also break a lot of internal assumptions because filenames are kept internally as Unicode data and so are URLs. But creating URLs out of Unicode filenames requires decoding to the 8-bit format and recoding in UTF-8.

In other words, a code section like:
  QString path = QFile::decodeName("/foo/Vidéo");
  url.setPath(path);
would produce different encodings depending on whether the schema (protocol) of "url" is "file". Worse, it also affects file-like protocols like "media", "system", "nfs".

So, in conclusion, I will not spend any effort fixing that problem. Switch to UTF-8 already. If that one doesn't work, I will fix.

(This discussion doesn't affect IRIs)
Comment 17 Micah Cowan 2007-05-16 17:08:32 UTC
I absolutely understand that it should be correctly translated to m%C3%B3ller.html. However, that URI, in turn, absolutely must be interpreted as Müller. It is not being interpreted as such.

My (theoretical) file name /is/ named in UTF-8. However, that doesn't matter, because Konqueror is reinterpreting its own generated URI to be in an encoding other than UTF-8, which seems pretty broken to me. And, why should "file" interpret it as the page's encoding, when "http" interprets it as UTF-8? That is inconsistent, makes no sense, violates standards, and serves no purpose.

Konquereror is doing the mapping from IRI to URI correctly, (though I fail to see why that mapping is even necessary: why not store it internally as an IRI, as I believe most implementations do), but you are not mapping the URI back to an IRI correctly. This is why I'm puzzled that you claim that "there's no proper solution for this problem;" clearly, encodings should be preserved wherever possible.

And, if you are claiming that M%C3%BCller should be failing for some reason (note that Konqueror considers M%C3%BCller to be a link to Müller), then why does even M%FCller gecome Müller? That situation is clearly broken: it's not an IRI, but Konqueror still translates it into unicode internally, and then /back/ into ISO-8859-1, completely in violation of standards and common sense.
Comment 18 Thiago Macieira 2007-05-16 17:20:57 UTC
I will take a look at this, but I believed the problem to be solved in KDE 4 (can't fix it in KDE 3).

Actually, I believed the whole IRI issue to be solved, so the fact that this bug is open probably indicates that it isn't.
Comment 19 Micah Cowan 2007-05-16 17:29:15 UTC
To be clear, I am using KDE 3; specifically Ubuntu's packaged KDE 3.5.6(-0ubuntu20.1).
Comment 20 Thiago Macieira 2007-05-16 19:42:10 UTC
And to be clear: KDE 3 cannot be fixed. Don't expect any patches.
Comment 21 Micah Cowan 2007-05-16 19:47:35 UTC
Thiago Macieira wrote:
> And to be clear: KDE 3 cannot be fixed. Don't expect any patches.


Understood, and thank you.


Created an attachment (id=20598)
OpenPGP digital signature
Comment 22 Martin J. Dürst 2007-05-20 10:07:15 UTC
The situation with file: is slightly different from http:. For http:, the main priority is that each URI and IRI works across the world, on paper and in electronic form. file: by it's nature doesn't work across the whole world, only locally.

Abstractly, the right thing to do with an URI of M%C3%B3ller.html is to find the file Müller.html. How this is actually done very much depends on the OS.
[in a similar way, how exactly M%C3%B3ller.html is resolved by an http: server may depend on OS, settings,... of the server (for ways to tweak that, see e.g.
http://www.w3.org/2003/06/mod_fileiri/)]

There are OSes where the file system works in terms of characters. MS Windows is an example (sorry, I know Konqueror is mainly or only Linux, but for file:, the MSW example helps). The NT-based versions use UTF-16(LE) internally, so what you need to do is to convert to UTF-16(LE) and then get that file via the wide-character file API. On a non-NT Win system, you have to convert to the system code page as far as you can and use the traditional API.

There are other OSes where traditionally, file names in the file system are just byte strings, with some exterior setting determining how these are viewed. The typical example here is Unix/Linux. The locale (LANG environment variable) determines how the bytes in the file system are viewed or interpreted. So one solution would be to use the character encoding part of the locale. This would work for all cases where the character encoding in all locales used on a box is the same. Many newer distributions come with a lot of UTF-8 locales, and that's the easiest case in many ways. It seems that whoever implemented "and then translated /back/ into the page's native encoding" made the assumption that on a system with locales with character encoding foo, all the local Web pages would also be encoded in foo. But that may or may not be true.
Comment 23 Micah Cowan 2007-05-20 11:44:13 UTC
Note that, in my example, neither of M%C3%B3ller nor M%FCller work. It seems to me, that whether or not you choose a particular of the two, one of them at least ought to. If you choose to interpret them as UTF-8 characters, then they should be UTF-8 characters; if literal byte values (as, I understand, the very original URI specs may have intended), then literal byte values. But to treat it as encoded text that must be transcoded into UTF-8, and then reinterpreted again in the original encoding, could never be of any possible use to anyone.

Still, if it is indeed fixed in KDE 4, that is at least answer enough for me. At any rate, I don't personally use Konqueror, and am just trying to resolve an end-user issue for those that do. If there is no solution but to wait for the next version, then so be it.
Comment 24 Martin J. Dürst 2007-05-20 14:52:04 UTC
Can you please tell us the following: a) what's the character encoding of the HTML (or other) file that contains the file: IRI, and is that HTML file labeled clearly so that the browser actually gets that encoding right? b) What kind of file system are you using, and what's the encoding of the file name in the file system.
Comment 25 Thiago Macieira 2007-05-20 18:57:01 UTC
Update on Konqueror 4:

1) local files:
  Filesystem is in UTF-8. This will be the only configuration I will support.
  a) HTML entities: OK
  b) %-encoding the UTF-8 sequence: OK
  c) direct character (page in UTF-8): OK
  d) direct character (page in Latin 1): OK
  e) %-encoded non-UTF-8 byte sequence: NOK

2) http://www.w3.org/2001/08/iri-test/:
  a) img src: OK
  b) link D%FCrst: NOK
  c) link D%C3%BCrst: OK
     (status bar shows %-encoding → bug)
  d) IDNs: all OK
     (status bar shows punycode → bug;
      direct characters and %-encoded links not shown as visited → bug)

3) http://www.w3.org/International/tests/sec-idn-1.html:
  a) clicking the link: OK
  b) typing the address in the location bar: semi-OK
     (typing the address and clicking the Enter-like button works;
      pressing Enter doesn't)
  b.bis) middle-clicking the webpage with the address in the clipboard (i.e. paste the address into the page): OK

  In all cases, the Location bar shows the punycode address → bug.
Comment 26 Thiago Macieira 2007-05-20 19:20:11 UTC
The 1.e and 2.b NOKs above are caused by Qt bugs.

The status bar and location showing the ACE form defects are caused by KUrl::prettyUrl doing nothing useful.
Comment 27 Thiago Macieira 2007-05-21 19:02:19 UTC
SVN commit 667039 by thiago:

Prettify KUrl::prettyUrl(). This solves a defect with the Konqueror
UI, where ACE/Punycode URLs were being shown instead of the proper
Unicode strings.

CCBUG:55177


 M  +63 -13    kurl.cpp  


--- trunk/KDE/kdelibs/kdecore/io/kurl.cpp #667038:667039
@@ -891,23 +891,73 @@
   return QString::fromLatin1( toEncoded( trailing == RemoveTrailingSlash ? StripTrailingSlash : None ) ); // ## check encoding
 }
 
+static QString toPrettyPercentEncoding(const QString &input)
+{
+  QString result;
+  for (int i = 0; i < input.length(); ++i) {
+    QChar c = input.at(i);
+    register short u = c.unicode();
+    if (u < 0x20 || u == '?' || u == '#' || u == '%') {
+      static const char hexdigits[] = "0123456789ABCDEF";
+      result += QLatin1Char('%');
+      result += QLatin1Char(hexdigits[(u & 0xf0) >> 4]);
+      result += QLatin1Char(hexdigits[u & 0xf]);
+    } else {
+      result += c;
+    }
+  }
+
+  return result;
+}
+
 QString KUrl::prettyUrl( AdjustPathOption trailing ) const
 {
-  // Can't use toString(), it breaks urls with %23 in them (becomes '#', which is parsed back as a fragment)
-  // So prettyUrl is just url, with the password removed.
-  // TODO: we could consider a "toLocalFile or URL" behavior, now that the KUrl constructor can take local paths.
-  // We could replace some chars, like "%20" -> ' ', though?
-  if ( password().isEmpty() )
-    return url( trailing );
+  // reconstruct the URL in a "pretty" form
+  // a "pretty" URL is NOT suitable for data transfer. It's only for showing data to the user.
+  // however, it must be parseable back to its original state, since
+  // notably Konqueror displays it in the Location address.
 
-  QUrl newUrl( *this );
-  newUrl.setPassword( QString() );
-  if ( trailing == AddTrailingSlash && !path().endsWith( QLatin1Char('/') ) ) {
-      // -1 and 0 are provided by QUrl, but not +1.
-      newUrl.setPath( path() + QLatin1Char('/') );
-      return QString::fromLatin1( newUrl.toEncoded() );
+  // A pretty URL is the same as a normal URL, except that:
+  // - the password is removed
+  // - the hostname is shown in Unicode (as opposed to ACE/Punycode)
+  // - the pathname and fragment parts are shown in Unicode (as opposed to %-encoding)
+  QString result = scheme();
+  if (!result.isEmpty())
+    result += QLatin1String("://");
+
+  QString tmp = userName();
+  if (!tmp.isEmpty()) {
+    result += tmp;
+    result += QLatin1Char('@');
   }
-  return QString::fromLatin1( newUrl.toEncoded(  trailing == RemoveTrailingSlash ? StripTrailingSlash : None ) );
+
+  result += host();
+
+  if (port() != -1) {
+    result += QLatin1Char(':');
+    result += QString::number(port());
+  }
+
+  tmp = path();
+  result += toPrettyPercentEncoding(tmp);
+
+  // adjust the trailing slash, if necessary
+  if (trailing == AddTrailingSlash && !tmp.endsWith(QLatin1Char('/')))
+    result += QLatin1Char('/');
+  else if (trailing == RemoveTrailingSlash && tmp.length() > 1 && tmp.endsWith(QLatin1Char('/')))
+    result.chop(1);
+
+  if (hasQuery()) {
+    result += QLatin1Char('?');
+    result += QLatin1String(encodedQuery());
+  }
+
+  if (hasFragment()) {
+    result += QLatin1Char('#');
+    result += toPrettyPercentEncoding(fragment());
+  }
+
+  return result;
 }
 
 #if 0
Comment 28 stanv 2007-05-22 10:40:11 UTC
Created attachment 20663 [details]
solve bug problem

Hello Thiago Macieira.

Patch for Prettify KUrl::prettyUrl() doesn't solve this bug problem.

Please see attached patch.

The main problem is in 
kdelibs-3.5.6/kdecore/kurl.cpp
in constructor:
KURL::KURL( const KURL& _u, const QString& _rel_url, int encoding_hint )

at: 599 :KURL tmp( url() + rUrl, encoding_hint);
		   /\
		   ||
		   ||
		   not encoded in encoding_hint

Have fun.
Comment 29 stanv 2007-05-22 10:45:47 UTC
sorry, missing bug number.
Sorry again.
Comment 30 Christoph Feck 2011-07-26 13:19:33 UTC
Should be fixed in KDE 4.6, if not, please reopen with an updated test case.