Summary: | URI<->IRI conversion uses page encoding instead of UTF-8 | ||
---|---|---|---|
Product: | [Unmaintained] kdelibs | Reporter: | Martin J. Dürst <duerst> |
Component: | general | Assignee: | Thiago Macieira <thiago> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | cfeck, dev, hephooey_dev, hhielscher, johann-nikolaus, konq-bugs, maarizwan, micahcowan, nadavkav, thiago |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Debian testing | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
OpenPGP digital signature
solve bug problem |
Description
Martin J. Dürst
2003-02-25 21:42:42 UTC
It would be easier if there was an indication that an iri is in fact an iri and not a misguided uri. The web is full with misguided uri's so we can not afford to support iri's at the expense of these uris. The recommendation to first try a utf8 encoding and then the page encoding is highly impractical for us. I agree. IMO, the best method for handling these situations is to take the character such as it was read from the page, in the encoding it was sent. Treat URIs as binary 8-bit data such as how we read it from the HTML page -- i.e., no conversion whatsoever. I can't be sure now KDE behaves correctly. I'd have to check. But that would introduce the same problem as what we've discussed in #56071: data that doesn't translate into Unicode. Just getting back: if the link reference contains %HH hex encoding, Konqueror loads the names correctly. That is, the URIs are treated as binary 8-bit data. However, the status line shows a Latin 1-decoding, which should not happen. If, however, the reference contains non-ASCII characters that are invalid in the page's charset, Konqueror chokes and attempts to load invalid URIs. In special, it starts to load stuff containing the UTF-8 "character replacement" character. But, I don't believe this specific problem to be of great importance, since the pages containing those invalid code sequences will be invalid at source as well. The IRI spec (currently at http://www.w3.org/International/iri-edit/draft-duerst-iri-11.txt) has been approved as an IETF Proposed Standard by the IESG (see http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00752.html) and will soon be issued as an RFC. In short, IRI spec defines how to convert non-ASCII characters in a Web address to a fully standard URI, using UTF-8 and %HH-encoding. Given that this is now approved, KDE code should move towards implementing this without delay. In particular, this means that any URIs that contain non-ASCII characters should be converted to %HH-form for resolution using UTF-8. Using a legacy encoding is (and always has been) a clearly totally non-standard way of resolving Web addresses with non-ASCII characters. Opera and Safari already have a good implementation of IRIs. IE does so, too, with the exception of IDNs. This is not about stuff that is invalid in the page's charset (which makes the whole page invalid indeed), but about characters that are valid in the page's charset. Waldo Bastian talks about 'misguided URIs'. Well, that's what they are: misguided. The best thing for the Web that can happen with them is that they fail, then they'll disappear. No spec ever said they'd work. *** Bug 89536 has been marked as a duplicate of this bug. *** *** Bug 106216 has been marked as a duplicate of this bug. *** See also bug #106195 (I let other people decide if it is the same problem or not). *** Bug 106195 has been marked as a duplicate of this bug. *** *** Bug 108384 has been marked as a duplicate of this bug. *** *** Bug 110330 has been marked as a duplicate of this bug. *** *** Bug 115247 has been marked as a duplicate of this bug. *** *** Bug 115245 has been marked as a duplicate of this bug. *** *** Bug 117518 has been marked as a duplicate of this bug. *** *** Bug 123149 has been marked as a duplicate of this bug. *** I'd like to point out that currently the file:// scheme seems to be significantly more broken than the http:// scheme, as path components (not hostnames) that include non-ASCII chars within an http:// URL appear to be interpreted fine (provided that they are specified directly or via entity references, and not percents); whereas the same path component appearing in a file:// link is apparently translated into UTF-8, and then translated /back/ into the page's native encoding, resulting in a broken link. Perhaps some of the bugs that have to do with this specific behavior, and have been marked as duplicates of this bug, should be considered separately? At least, it doesn't appear to be a general IRI problem so much as a problem with the file:// URI scheme interpretation, specifically. https://bugs.launchpad.net/ubuntu/+source/kdebase/+bug/50213 Has much more detail on what I'm referring to. It doesn't reference the fact that those same links work if "file:///" is replaced with "http://foo/", but I've tested it. Compare http://micah.cowan.name/50213/test.html.latin1 with http://micah.cowan.name/50213/test.http.html.latin1 (not technically valid HTML; but the results hold true with conforming code as well). I am aware of the issue. But, as far as I know, there's no proper solution for this problem. So, I'm letting the problem remain unsolved. IRIs (which include local files) indicate that URLs are to be interpreted as a sequence of Unicode characters, encoded in UTF-8. That means that, if you write <a href="müller.html"> or even <a href="müller.html"> The URI fragment contains one non-ASCII Unicode codepoint (U+00FC, independent of the page's encoding). That is correctly translated to "m%C3%BCller.html". Let me repeat: *correctly* translated. One possible solution is to mandate that people write HTML pages referring to local files using %-encoding if their files aren't named in UTF-8. That would be suboptimal because the URLs in Konqueror's Location field don't exactly correspond to the directory names displayed down below. It would also break a lot of internal assumptions because filenames are kept internally as Unicode data and so are URLs. But creating URLs out of Unicode filenames requires decoding to the 8-bit format and recoding in UTF-8. In other words, a code section like: QString path = QFile::decodeName("/foo/Vidéo"); url.setPath(path); would produce different encodings depending on whether the schema (protocol) of "url" is "file". Worse, it also affects file-like protocols like "media", "system", "nfs". So, in conclusion, I will not spend any effort fixing that problem. Switch to UTF-8 already. If that one doesn't work, I will fix. (This discussion doesn't affect IRIs) I absolutely understand that it should be correctly translated to m%C3%B3ller.html. However, that URI, in turn, absolutely must be interpreted as Müller. It is not being interpreted as such. My (theoretical) file name /is/ named in UTF-8. However, that doesn't matter, because Konqueror is reinterpreting its own generated URI to be in an encoding other than UTF-8, which seems pretty broken to me. And, why should "file" interpret it as the page's encoding, when "http" interprets it as UTF-8? That is inconsistent, makes no sense, violates standards, and serves no purpose. Konquereror is doing the mapping from IRI to URI correctly, (though I fail to see why that mapping is even necessary: why not store it internally as an IRI, as I believe most implementations do), but you are not mapping the URI back to an IRI correctly. This is why I'm puzzled that you claim that "there's no proper solution for this problem;" clearly, encodings should be preserved wherever possible. And, if you are claiming that M%C3%BCller should be failing for some reason (note that Konqueror considers M%C3%BCller to be a link to Müller), then why does even M%FCller gecome Müller? That situation is clearly broken: it's not an IRI, but Konqueror still translates it into unicode internally, and then /back/ into ISO-8859-1, completely in violation of standards and common sense. I will take a look at this, but I believed the problem to be solved in KDE 4 (can't fix it in KDE 3). Actually, I believed the whole IRI issue to be solved, so the fact that this bug is open probably indicates that it isn't. To be clear, I am using KDE 3; specifically Ubuntu's packaged KDE 3.5.6(-0ubuntu20.1). And to be clear: KDE 3 cannot be fixed. Don't expect any patches. Thiago Macieira wrote:
> And to be clear: KDE 3 cannot be fixed. Don't expect any patches.
Understood, and thank you.
Created an attachment (id=20598)
OpenPGP digital signature
The situation with file: is slightly different from http:. For http:, the main priority is that each URI and IRI works across the world, on paper and in electronic form. file: by it's nature doesn't work across the whole world, only locally. Abstractly, the right thing to do with an URI of M%C3%B3ller.html is to find the file Müller.html. How this is actually done very much depends on the OS. [in a similar way, how exactly M%C3%B3ller.html is resolved by an http: server may depend on OS, settings,... of the server (for ways to tweak that, see e.g. http://www.w3.org/2003/06/mod_fileiri/)] There are OSes where the file system works in terms of characters. MS Windows is an example (sorry, I know Konqueror is mainly or only Linux, but for file:, the MSW example helps). The NT-based versions use UTF-16(LE) internally, so what you need to do is to convert to UTF-16(LE) and then get that file via the wide-character file API. On a non-NT Win system, you have to convert to the system code page as far as you can and use the traditional API. There are other OSes where traditionally, file names in the file system are just byte strings, with some exterior setting determining how these are viewed. The typical example here is Unix/Linux. The locale (LANG environment variable) determines how the bytes in the file system are viewed or interpreted. So one solution would be to use the character encoding part of the locale. This would work for all cases where the character encoding in all locales used on a box is the same. Many newer distributions come with a lot of UTF-8 locales, and that's the easiest case in many ways. It seems that whoever implemented "and then translated /back/ into the page's native encoding" made the assumption that on a system with locales with character encoding foo, all the local Web pages would also be encoded in foo. But that may or may not be true. Note that, in my example, neither of M%C3%B3ller nor M%FCller work. It seems to me, that whether or not you choose a particular of the two, one of them at least ought to. If you choose to interpret them as UTF-8 characters, then they should be UTF-8 characters; if literal byte values (as, I understand, the very original URI specs may have intended), then literal byte values. But to treat it as encoded text that must be transcoded into UTF-8, and then reinterpreted again in the original encoding, could never be of any possible use to anyone. Still, if it is indeed fixed in KDE 4, that is at least answer enough for me. At any rate, I don't personally use Konqueror, and am just trying to resolve an end-user issue for those that do. If there is no solution but to wait for the next version, then so be it. Can you please tell us the following: a) what's the character encoding of the HTML (or other) file that contains the file: IRI, and is that HTML file labeled clearly so that the browser actually gets that encoding right? b) What kind of file system are you using, and what's the encoding of the file name in the file system. Update on Konqueror 4: 1) local files: Filesystem is in UTF-8. This will be the only configuration I will support. a) HTML entities: OK b) %-encoding the UTF-8 sequence: OK c) direct character (page in UTF-8): OK d) direct character (page in Latin 1): OK e) %-encoded non-UTF-8 byte sequence: NOK 2) http://www.w3.org/2001/08/iri-test/: a) img src: OK b) link D%FCrst: NOK c) link D%C3%BCrst: OK (status bar shows %-encoding → bug) d) IDNs: all OK (status bar shows punycode → bug; direct characters and %-encoded links not shown as visited → bug) 3) http://www.w3.org/International/tests/sec-idn-1.html: a) clicking the link: OK b) typing the address in the location bar: semi-OK (typing the address and clicking the Enter-like button works; pressing Enter doesn't) b.bis) middle-clicking the webpage with the address in the clipboard (i.e. paste the address into the page): OK In all cases, the Location bar shows the punycode address → bug. The 1.e and 2.b NOKs above are caused by Qt bugs. The status bar and location showing the ACE form defects are caused by KUrl::prettyUrl doing nothing useful. SVN commit 667039 by thiago: Prettify KUrl::prettyUrl(). This solves a defect with the Konqueror UI, where ACE/Punycode URLs were being shown instead of the proper Unicode strings. CCBUG:55177 M +63 -13 kurl.cpp --- trunk/KDE/kdelibs/kdecore/io/kurl.cpp #667038:667039 @@ -891,23 +891,73 @@ return QString::fromLatin1( toEncoded( trailing == RemoveTrailingSlash ? StripTrailingSlash : None ) ); // ## check encoding } +static QString toPrettyPercentEncoding(const QString &input) +{ + QString result; + for (int i = 0; i < input.length(); ++i) { + QChar c = input.at(i); + register short u = c.unicode(); + if (u < 0x20 || u == '?' || u == '#' || u == '%') { + static const char hexdigits[] = "0123456789ABCDEF"; + result += QLatin1Char('%'); + result += QLatin1Char(hexdigits[(u & 0xf0) >> 4]); + result += QLatin1Char(hexdigits[u & 0xf]); + } else { + result += c; + } + } + + return result; +} + QString KUrl::prettyUrl( AdjustPathOption trailing ) const { - // Can't use toString(), it breaks urls with %23 in them (becomes '#', which is parsed back as a fragment) - // So prettyUrl is just url, with the password removed. - // TODO: we could consider a "toLocalFile or URL" behavior, now that the KUrl constructor can take local paths. - // We could replace some chars, like "%20" -> ' ', though? - if ( password().isEmpty() ) - return url( trailing ); + // reconstruct the URL in a "pretty" form + // a "pretty" URL is NOT suitable for data transfer. It's only for showing data to the user. + // however, it must be parseable back to its original state, since + // notably Konqueror displays it in the Location address. - QUrl newUrl( *this ); - newUrl.setPassword( QString() ); - if ( trailing == AddTrailingSlash && !path().endsWith( QLatin1Char('/') ) ) { - // -1 and 0 are provided by QUrl, but not +1. - newUrl.setPath( path() + QLatin1Char('/') ); - return QString::fromLatin1( newUrl.toEncoded() ); + // A pretty URL is the same as a normal URL, except that: + // - the password is removed + // - the hostname is shown in Unicode (as opposed to ACE/Punycode) + // - the pathname and fragment parts are shown in Unicode (as opposed to %-encoding) + QString result = scheme(); + if (!result.isEmpty()) + result += QLatin1String("://"); + + QString tmp = userName(); + if (!tmp.isEmpty()) { + result += tmp; + result += QLatin1Char('@'); } - return QString::fromLatin1( newUrl.toEncoded( trailing == RemoveTrailingSlash ? StripTrailingSlash : None ) ); + + result += host(); + + if (port() != -1) { + result += QLatin1Char(':'); + result += QString::number(port()); + } + + tmp = path(); + result += toPrettyPercentEncoding(tmp); + + // adjust the trailing slash, if necessary + if (trailing == AddTrailingSlash && !tmp.endsWith(QLatin1Char('/'))) + result += QLatin1Char('/'); + else if (trailing == RemoveTrailingSlash && tmp.length() > 1 && tmp.endsWith(QLatin1Char('/'))) + result.chop(1); + + if (hasQuery()) { + result += QLatin1Char('?'); + result += QLatin1String(encodedQuery()); + } + + if (hasFragment()) { + result += QLatin1Char('#'); + result += toPrettyPercentEncoding(fragment()); + } + + return result; } #if 0 Created attachment 20663 [details]
solve bug problem
Hello Thiago Macieira.
Patch for Prettify KUrl::prettyUrl() doesn't solve this bug problem.
Please see attached patch.
The main problem is in
kdelibs-3.5.6/kdecore/kurl.cpp
in constructor:
KURL::KURL( const KURL& _u, const QString& _rel_url, int encoding_hint )
at: 599 :KURL tmp( url() + rUrl, encoding_hint);
/\
||
||
not encoded in encoding_hint
Have fun.
sorry, missing bug number. Sorry again. Should be fixed in KDE 4.6, if not, please reopen with an updated test case. |