Bug 62425

Summary: Lowercasing/normalising URLs breaks some non-compliant protocols
Product: [Unmaintained] kdelibs Reporter: Thiago Macieira <thiago>
Component: generalAssignee: Thiago Macieira <thiago>
Status: RESOLVED FIXED    
Severity: normal CC: bastian, mail, ummo
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: Support for URI processing modes and per-protocol mode selection.
Processing modes, second attempt.
Processing modes, updated.
Processing modes patch without ugly parser
Processing modes with working mailto mode.
KProtocolInfo::uriParseMode default value fix

Description Thiago Macieira 2003-08-09 19:24:02 UTC
Version:            (using KDE Devel)
Installed from:    Compiled sources
OS:          Linux

Since the IDNA changes were incorporated into KURL and friends, the hostnames are normalised internally according to the following RFCs:

RFC 3454 - Preparation of Internationalized Strings
RFC 3490 - Internationalizing Domain Names in Applications (IDNA)
RFC 3491 - Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)
RFC 3492 - Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)

That means that, among other modifications, that hostnames in URLs get lowercased. Actually, if libidn is not found, current KDE CVS code simply lowercases the URL. This behaviour is (as per my reading) acceptable according to:

STD 3 [mentioned several times in the rules in the RFCs above]
RFC 1738 - Uniform Resource Locators (URL)

However, there are some "URLs" in use out there that do not respect the rules given by the IDNA RFCs and there could be even those that disrespect STD 3. One such scheme is ed2k, which is known to encode a filename inside the hostname part of the URL.

My understanding is that calling ed2k:// a URL is wrong, but RFC 2396 (Uniform Resource Identifiers (URI): Generic Syntax) (URIs are a superset of URLs) allows that. Therefore, ed2k:// and others might be valid URIs even though they are not valid URLs.

This has been a bit of summary of the thread in kde-devel:
http://lists.kde.org/?t=105978281800002&r=1&w=2&n=16

Anyways, as suggested in that thread, KURL should support some kind of different URIs that are not URLs and be aware of their peculiarities. I'm here proposing three different URI types:

internet or url - normal URLs, as they are treated today
mailto - URIs containing mail addresses; IDNA needs to be applied to the hostname part (after the @)
raw - don't touch, don't encode

I've opened this bug report to keep track of the support and also for the information as to why some broken "URLs" broke in KDE CVS/3.2.
Comment 1 Thiago Macieira 2003-08-10 23:47:18 UTC
Moving the discussion here. 
 
Last message from the list was (from Petter Stokke): 
 
My current working copy uses a simple hardcoded list ("ed2k" is raw,  
"mailto" is mailto, and everything else is a normal URL), I guess it could  
fall back on that if no KInstance is available. 
  
  
> But you should also provide a means of the user setting the  
> URI type and handling method. 
 
  
I added that as an option to the constructor, with the present processing  
mode being the default one. 
  
  
> KURL should be lightweight. So, I don't recommend writing code to find 
> files in the disk. 
 
  
I'm trying hard to keep it lightweight, but I don't know any other way to  
do it except to hardcode the protocol list into KURL, which doesn't really  
appeal to me. I was planning on writing a very lightweight parser to fetch  
the protocol mode from the .protocol file, and storing it in a hard cache  
for future use (so it doesn't care if the .protocol file changes) in order  
to maintain some semblance of efficiency. 
Comment 2 Thiago Macieira 2003-08-10 23:50:17 UTC
Agreed on the part of the no-rereading. Protocols shouldn't reasonably change within the 
lifetime of a program. 
 
Another solution would be to code this list inside kdeglobals. It would then be a static list, 
instead of one that each protocol could determine the handling method. 
 
One way or another, I recommend using a QDict hashing table for faster lookups. 
Comment 3 Petter Stokke 2003-08-11 01:23:12 UTC
I've used a QMap, it seems a bit excessive to have to allocate memory for
storing ints in a QDict. But that brings up another issue: should I add locks
around the table access to ensure thread safety, or is KURL in general not
thread safe?

For now, I've written a simple parser using QRegExp to pull the relevant data
out of the .protocol file if KGlobal::activeInstance() returns non-NULL and we
can thus determine the location of the files, defaulting to a hardcoded list
inside KURL if it does not, or if kdecore is being compiled with KDE_QT_ONLY
defined. The whole database could be moved elsewhere if it seems practical, but
for the moment I've tried to keep KURL as self-sufficient as possible.
Comment 4 Petter Stokke 2003-08-12 19:06:46 UTC
Created attachment 2222 [details]
Support for URI processing modes and per-protocol mode selection.

Here's the proposed patch. I'm still trying to find things which break after
applying it, but it seems to have become stable enough for review. At least
KHTML seems to be quite OK with it.

It adds a static method to KURL (uriModeForProtocol) for accessing the
protocol/mode database, and some constructors and methods for manually
specifying processing modes. The processing modes for standardised URLs
(previous and default behaviour) and non-standard but processable URLs (called
RawURI in the code, which on second thought might be incorrect) are
implemented; I've added database support but no implementation for mailto style
URIs as well.
Comment 5 Petter Stokke 2003-08-12 19:34:08 UTC
Oh, I should add that the parser looks for "urimode=<mode>" in the relevant
.protocol file, where <mode> is either "rawuri", "mailto" or anything else
(which signifies a standard URL).
Comment 6 Thiago Macieira 2003-08-13 13:49:36 UTC
Ok, I've reviewed your patch and I have two comments about it. I'll apply it to my local 
tree in my next KDE recompile (two weeks from now, I've got some other time constraints) 
and help you testing it. 
 
Anyways, my comments are: 
1) I don't like the URIMODE_ prefix (all capitals with an underscore) in the enum. In my 
view, it's ugly, but that's me. We'll see if the later if the coding style guide says anything 
about that. 
 
2) I think you should be bold and move all the processing into different parse functions, 
and make KURL::parse simply select the parser function from the uriMode. That way, we'd 
have KURL::parseNormalURI being a copy of the current KURL::parse, then we'd also have 
KURL::parseRawURI (essentially a no-op) and a future KURL::parseMailtoURI. That would 
make the code cleaner. 
 
Oh, I should also say that the URI RFC (2396) specifies that all URIs have the "schema" part 
(what we call protocol). So that part of the code can be inside KURL::parse, which would 
also be required in order to automatically find the URI Mode. 
Comment 7 Petter Stokke 2003-08-13 16:15:05 UTC
> 1) I don't like the URIMODE_ prefix (all capitals with an underscore) in the
> enum. In my view, it's ugly, but that's me. We'll see if the later if the coding
> style guide says anything about that.

I originally didn't use a prefix there, but I got worried about namespace
collisions. The prefix style was inspired by the enums in KIO, but I'll be the
first to agree with you about their ugliness. Possibly we should just do away
with the prefix altogether? As long as they're capitalised, at least they won't
collide with any method names.

> 2) I think you should be bold and move all the processing into different parse
> functions, and make KURL::parse simply select the parser function from the
> uriMode. That way, we'd have KURL::parseNormalURI being a copy of the current
> KURL::parse, then we'd also have KURL::parseRawURI (essentially a no-op) and a
> future KURL::parseMailtoURI. That would make the code cleaner. 

Good idea - the way raw URIs are parsed right now is identical to the standard
parser, and with ed2k URLs what actually happens is that the whole URL aside
from the protocol ends up in the host field. This is syntactically correct if
you assume ed2k is standards compliant, but extremely counter-intuitive - it
would probably be a lot better to just put the whole URI (not including the
protocol) into the path field instead of trying to parse it as a standard URL
without the IDNA code applied on the host name like it's done right now.


Another thing that bothers me: I've tried to maintain binary compatibility as
much as possible, but I think I broke it after all, judging by Konqueror's
misbehaving before I did a complete kdebase recompile. The three extra
constructors I added are a result of this, but I find them less than pleasing -
I'd rather put the URIMode argument into the original three as a default
argument. Is that a big issue?
Comment 8 Thiago Macieira 2003-08-13 16:29:46 UTC
1) They are already namespace-safe because the enum is declared inside the KURL class. 
As long as nobody goes about adding methods with the same names as the enum 
constants, we'll be fine. 
 
And that won't happen because method names should always start with a lowercase letter 
(name it myFunctionLikeThis), while the enum constants start with a capital letter 
(MyEnumConstant). And we don't do underscore hell. :-) 
 
2) No, you can't change the existing constructors without breaking binary compatibility. You 
can, however, make your new constructors look like the old ones and add a comment like 
this: 
 ## BCI KDE 4.0: merge with above 
(BCI stands for Binary Compatibility Issue) 
 
The way I see it, KURL needs a cleanup for KDE 4.0. I'd actually vote for having it be 
subclassed into the different URI modes. (KURL would be a misnomer for KURI, but that's 
what is used throughout) 
 
My first glance at your code did not spot any Binary Incompatible Changes. It should be 
safe to run without rebuilding kdebase. Your problems might be caused by one of your 
earlier builds. I'll test that when I build it myself. 
 
Actually, scratch that. I've just spotted the BIC: you've added a new member variable to the 
class, increasing its size. You can't do that. Add it to the d pointer and change your uriMode
() inline function into a full one. 
Comment 9 Petter Stokke 2003-08-14 18:24:33 UTC
Created attachment 2241 [details]
Processing modes, second attempt.

Here's a new patch. I've not been able to tell in practice if it breaks binary
compatibility, as I had to recompile my whole KDE again because it certainly
breaks compatibility with the previous patch, but as far as I can tell it looks
safe.

I've also split the parse function into one for each mode, as suggested, and
while I was at it I implemented the mailto parsing with IDNA conversion of the
host part of the email address.
Comment 10 Petter Stokke 2003-08-25 20:06:10 UTC
Created attachment 2311 [details]
Processing modes, updated.

There have been some conflicting commits to the KURL code lately, so I thought
I'd make a new patch. After two weeks of testing, the code seems stable; can I
commit it soon? :)
Comment 11 Thiago Macieira 2003-08-25 20:14:04 UTC
Well, I'm still unable to compile or run KDE, but I hope this situation will 
change later this week. I recommend, though, you post the patch to kde-core-
devel and ask for comments and more testers.
Comment 12 Matthias Wieser 2003-08-30 01:46:50 UTC
Urls are case-insensitive. So there is no deed to change valid URLs to lower 
case. 
If, for example, an user writes http://ReallyLongUrl.com because it's better 
readable then respect this. 
Comment 13 Thiago Macieira 2003-08-30 01:50:55 UTC
The URL gets lowercased because it's part of the normalisation process. That 
process is also responsible for changing the 
Comment 14 Leo Savernik 2003-09-15 00:12:21 UTC
I may add my own wisdom here regarding data urls (rfc2397).  
  
data urls don't comply with [rfc1738, 3.1] at all, they have no host part and  
no path. Therefore, *everything* has to be treated literally.  
  
However, rfc2397 states they are subject to decoding. That means the data url  
  
data:,%31%30%30  
  
should be decoded to  
  
data:,100  
  
Interpreting the definition of KURL::RawURI, this won't happen. If this is the  
case (and I haven't misinterpreted it), data urls need another URIMode like  
RawDecodeable.  
  
I also think it's a bad idea to put the raw url into the path variable. The  
rationale is that for data urls, there is no such thing that is even remotely  
comparable to a path. DOM's HTMLAnchorElement::pathname, however, gets  
initialized with whatever KURL::path() returns which would be total crap in the  
case of data urls. But Javascript applications expect a valid pathname here  
(Mozilla, e. g. fills in nothing for pathname on data urls).  
  
Therefore I propose that with the exception of protocol() and url(), all  
component-retrieval methods return an empty string on raw urls. As url()  
returns the url with encodings left in place, maybe there should be an  
additional function urlDecoded().  
  
Petter: Please include "data" in the list for autodetection 
 
	if ( protocol == "ed2k" ) mode = RawURI; 
+       else if ( protocol == "data" ) mode = RawURI; 
	else if ( protocol == "mailto" ) mode = Mailto; 
 
Comment 15 Petter Stokke 2003-09-15 01:24:44 UTC
Subject: Re:  Lowercasing/normalising URLs breaks some
	non-compliant protocols

> However, rfc2397 states they are subject to decoding. That means the data url  
>   
> data:,%31%30%30  
>   
> should be decoded to  
>   
> data:,100  
>   
> Interpreting the definition of KURL::RawURI, this won't happen. If this is the  
> case (and I haven't misinterpreted it), data urls need another URIMode like  
> RawDecodeable.  

I'm not sure about the definition, but the RawURI implementation, at
least, does decode its input, because ed2k, sig2dat and magnet URIs want
the same behaviour as data. In fact, I don't think there's ever going to
be a situation where we don't want to decode %xx constructs.
  
> [...] I propose that with the exception of protocol() and url(), all  
> component-retrieval methods return an empty string on raw urls.

This might cause problems with mailto URIs, though - the old behaviour
would return the email address as path(). In my implementation, they're
treated as RawURI, but with extra processing for the host part, which is
the reasoning behind putting it in path() in both cases. Then again,
it's perfectly possible to make this behaviour exclusive to mailto.

> As url() returns the url with encodings left in place, maybe there should be an
> additional function urlDecoded().

Isn't that pretty much what prettyURL() does?

> Petter: Please include "data" in the list for autodetection 

Done.


Comment 16 Leo Savernik 2003-09-15 12:28:00 UTC
Well, in this case data urls are all RawURI. 
 
As you tell it, mailto should be treated exceptionally. 
 
Doesn't prettyURL mean: Do any amount of incomprehensible, irreversible 
distortion to make the url pretty to read, but totally unsuitable for any 
further processing? 
 
Data urls have to be treated literally (the only processing allowed is %xx 
decoding). If prettyURL does exactly do this for RawURI, and this behaviour is 
guaranteed (i. e. doesn't start doing distortions in KDE 3.3/4.0 because of 
some "usability" study), everything's well. 
 
But deducing from the explanation in the docs "Returns the URL as string in 
human-friendly format.", I'm afraid this won't hold true for long because one 
day a programmer will come, and *do* distortions to make is more 
"human-friendly". 
 
Comment 17 Thiago Macieira 2003-09-16 16:25:06 UTC
Hello, this is to let you know that I'm back to my own computer. KDE recompilation is 
scheduled for tonight :-) 
 
Anyways, about the KURL::prettyURL() call: it's intended for things shown to the user, but 
not for internal data. That is, if we had got data:%31%41%E1, Konqueror would show in 
the status bar "data:1A%E1", as well as in the Location window if clicked. But the URL being 
copied into the clipboard or sent to the ioslaves would be the one from KURL::url() -- that 
is, untransformed. 
 
The same is also true for IDN domains: url() returns the ACE-encoded domain, which is 
NEVER to be shown to the user outside debugging purposes, whereas prettyURL() shows 
them in the normalised Unicode form. 
Comment 18 David Faure 2003-10-20 00:31:19 UTC
> Doesn't prettyURL mean: Do any amount of incomprehensible, irreversible 
> distortion to make the url pretty to read, but totally unsuitable for any 
> further processing? 

Not at all. The golden rule of prettyURL is that
KURL( u.prettyURL() ) == u (or at least they're equivalent).

Otherwise pressing 'Enter' in the Konqueror location bar would lead to "invalid URL". So prettyURL gives a URL that _is_ suitable for further processing, and its transformations are usually reversible. Hmm did someone tell Waldo about this BR?
Comment 19 Petter Stokke 2003-10-30 00:25:19 UTC
Anything happening here? It's getting dangerously close to the 3.2 release...
Comment 20 Petter Stokke 2003-11-05 21:49:26 UTC
Created attachment 3046 [details]
Processing modes patch without ugly parser

Here's a new patch. What's different: KProtocolInfo now understands the
"urimode" field in the protocol description file. KURL uses KProtocolInfo if
available to get the data it needs, instead of relying on that ugly standalone
parser. Added some tests for the new API stuff, and fixed some bugs exposed by
the test suite.

How does this look?
Comment 21 Thiago Macieira 2003-11-05 22:23:58 UTC
Thank you, it looks now much nicer. Too bad KURL has to have a d-pointer now.

Anyways, there's just one thing missing I'd like to add: the automatic detection could check if the first two characters after the protocol and the colon are //. If they are, it should default to URL mode; if it doesn't, RawURI.

That is, blah://txt is URL and therefore "txt" is supposed to be a hostname, whereas blah:txt is not URL and "txt" can be anything. This is the very reason why ed2k:// is broken. If it didn't have those //, it would be ok.

PS: can you also add the @since 3.2 to the other members you added?
Comment 22 Petter Stokke 2003-11-05 22:49:22 UTC
> Anyways, there's just one thing missing I'd like to add: the automatic
> detection could check if the first two characters after the protocol and
> the colon are //. If they are, it should default to URL mode; if it
> doesn't, RawURI.
> 
> That is, blah://txt is URL and therefore "txt" is supposed to be a
> hostname, whereas blah:txt is not URL and "txt" can be anything. This is
> the very reason why ed2k:// is broken. If it didn't have those //, it
> would be ok.

Maybe I misunderstood what you meant, but that sounds to me like it would break a lot of things. If we do that, correctly formed URIs which don't start with // (like mailto, assuming that wasn't handled as a special case) would be parsed as raw URIs, meaning eg. "mailto:foo@bar.com?subject=Foobar" would get everything after the colon put into the path part, which is clearly incorrect. Moreover, all broken URLs I've come across so far (ed2k, sig2dat and slsk) all have the //, so 
making assumptions based on their presence doesn't make much sense either.

I think it's safer to assume everything should be parsed as a valid URL (as it is today), unless explicitly told otherwise by a protocol file or the hardcoded fallback list.

> PS: can you also add the @since 3.2 to the other members you added?

Oops. Done.
Comment 23 Thiago Macieira 2003-11-05 23:03:50 UTC
That's not what I meant. What I wanted to do is add an extra check for the automatic mode, for when KURL doesn't find a value for a given protocol.

The decision would be thus like this:
- if the mode is on cache, use it
- if the .protocol file has the value, use it (and cache)
- if the mode hasn't been detected, look at the URI schema (the protocol)
.- if it's mailto:, set the mode to MailTo mode
.- if it's one of the known broken schemas (ed2k, sig2dat, slsk), use RawURI
and here's the change:
.- if it's still unknown, but the non-schema part starts with //, it's URL
.- otherwise, it's RawURI

What the RFCs say is that anything://something-else is a URL and that's what those broken protocols violate (because they have // not followed by a hostname). Other schemas may be specified (like data or mailto) with different parsings.
Comment 24 Petter Stokke 2003-11-06 00:03:01 UTC
Look at these URLs (all of which are correctly parsed by KURL today):

aim:goim?screenname=Foobar&message=Hello+world
magnet:?xt=urn:sha1:PIXQRS5AZHRIXCBEJJS3W2JATUQ6XZZA&dn=foobar.zip
file:/foo/bar/baz?foo=bar

Assuming none of those protocols have a URIMode tag in their respective protocol files (file doesn't, and KDE doesn't even handle the other two), they would all be parsed as RawURI according to the method you suggest, which means KURL::queryItem() et al wouldn't work on them as expected, and, in the case of the file URL especially, KURL::path() would be unusable. RawURI doesn't parse anything besides putting whatever comes after the colon into m_strPath after expanding %xx constructs, which is clearly not what we want in these cases. It should be reserved for protocols which are explicitly known to be broken.
Comment 25 Petter Stokke 2003-11-11 17:37:57 UTC
Created attachment 3154 [details]
Processing modes with working mailto mode.

I broke the mailto processing in the last patch, so here's a working version,
along with the requested apidoc update. Is this starting to look committable?
Comment 26 Thiago Macieira 2003-11-11 22:12:18 UTC
It has been committable for some time. The problem is, it's never got applied. I just want someone with better knowledge of KURL than me to verify because it's a pretty intrusive patch (unlike anything I had done in KURL before). So far no one has stepped up to do that.
Comment 27 Waldo Bastian 2003-11-11 22:24:11 UTC
Will have a look at it later this week.
Comment 28 Thiago Macieira 2003-11-23 05:27:24 UTC
We have a slight problem with the current code. I've just fixed it halfway but the patch is not correct, so I'm not attaching anything. 

You may have noticed that Konqueror stopped caching correctly the webpages. I've spent a couple of hours tracing down the bug and I got it narrowed to KURL::url.

The reason for that is that the caching is handled in kio_http, but the URL is read from a data stream, which doesn't and cannot carry the d->uriMode variable (the wire format is crystallised and cannot change before KDE4, no room for updates).

That means that when HTTPProtocol::checkCacheEntry calls KURL::url to determine the cache, the value of d->uriMode can be undetermined and, therefore, the path name and query can be mishandled.

The same could also be true for other cases.

My recommendation is that operator>>(QDataStream&, KURL&) make sure that d->uriMode is properly set.
Comment 29 Waldo Bastian 2003-11-30 23:18:59 UTC
I am currently testing a patch, will commit after Beta2
Comment 30 Waldo Bastian 2003-12-10 15:19:01 UTC
Subject: kdelibs

CVS commit by waba: 

Make parsing of URLs protocol-dependant (BR62425)
Based on a patch by Petter E. Stokke
CCMAIL: 62425-done@bugs.kde.org


  M +26 -4     kdecore/kprotocolinfo_kdecore.cpp   1.8
  M +4 -3      kdecore/kprotocolinfofactory.cpp   1.2
  M +2 -2      kdecore/kprotocolinfofactory.h   1.4
  M +169 -9    kdecore/kurl.cpp   1.261
  M +54 -3     kdecore/kurl.h   1.130
  M +11 -0     kdecore/tests/kurltest.cpp   1.82
  M +1 -0      kio/data.protocol   1.6
  M +17 -0     kio/kio/kprotocolinfo.h   1.28



Comment 31 Thiago Macieira 2003-12-10 16:10:07 UTC
Please test if the query part isn't lost in KURL::url().

I.e., go to Google, get a search. Then hit Next. If it shows the same page again instead of the next results, it's this bug.
Comment 32 Petter Stokke 2003-12-12 02:06:17 UTC
Created attachment 3664 [details]
KProtocolInfo::uriParseMode default value fix

KProtocolInfo::uriParseMode needs to return Auto if the URI mode is missing,
otherwise the fallback list is never used, causing mailto URLs to be parsed as
regular URLs in the absence of a mailto.protocol file (there isn't one in KDE).
Comment 33 Sebastian Sauer 2008-03-11 22:18:16 UTC
Hi Thiago,

should we reopen this report cause QUrl and with it KDE4's KUrl are not any longer able to deal with RAW-mode url's/uri's?

probably somewhat related report is bug #159017
Comment 34 Thiago Macieira 2008-03-11 23:17:31 UTC
Hmm... looks like we lost the KURL fixes when we moved to QUrl.

QUrl is to-the-letter strict. If your URI is not a URL, placing it in a QUrl will lose some of the data.