Summary: | Lowercasing/normalising URLs breaks some non-compliant protocols | ||
---|---|---|---|
Product: | [Unmaintained] kdelibs | Reporter: | Thiago Macieira <thiago> |
Component: | general | Assignee: | Thiago Macieira <thiago> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | bastian, mail, ummo |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Compiled Sources | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
Support for URI processing modes and per-protocol mode selection.
Processing modes, second attempt. Processing modes, updated. Processing modes patch without ugly parser Processing modes with working mailto mode. KProtocolInfo::uriParseMode default value fix |
Description
Thiago Macieira
2003-08-09 19:24:02 UTC
Moving the discussion here. Last message from the list was (from Petter Stokke): My current working copy uses a simple hardcoded list ("ed2k" is raw, "mailto" is mailto, and everything else is a normal URL), I guess it could fall back on that if no KInstance is available. > But you should also provide a means of the user setting the > URI type and handling method. I added that as an option to the constructor, with the present processing mode being the default one. > KURL should be lightweight. So, I don't recommend writing code to find > files in the disk. I'm trying hard to keep it lightweight, but I don't know any other way to do it except to hardcode the protocol list into KURL, which doesn't really appeal to me. I was planning on writing a very lightweight parser to fetch the protocol mode from the .protocol file, and storing it in a hard cache for future use (so it doesn't care if the .protocol file changes) in order to maintain some semblance of efficiency. Agreed on the part of the no-rereading. Protocols shouldn't reasonably change within the lifetime of a program. Another solution would be to code this list inside kdeglobals. It would then be a static list, instead of one that each protocol could determine the handling method. One way or another, I recommend using a QDict hashing table for faster lookups. I've used a QMap, it seems a bit excessive to have to allocate memory for storing ints in a QDict. But that brings up another issue: should I add locks around the table access to ensure thread safety, or is KURL in general not thread safe? For now, I've written a simple parser using QRegExp to pull the relevant data out of the .protocol file if KGlobal::activeInstance() returns non-NULL and we can thus determine the location of the files, defaulting to a hardcoded list inside KURL if it does not, or if kdecore is being compiled with KDE_QT_ONLY defined. The whole database could be moved elsewhere if it seems practical, but for the moment I've tried to keep KURL as self-sufficient as possible. Created attachment 2222 [details]
Support for URI processing modes and per-protocol mode selection.
Here's the proposed patch. I'm still trying to find things which break after
applying it, but it seems to have become stable enough for review. At least
KHTML seems to be quite OK with it.
It adds a static method to KURL (uriModeForProtocol) for accessing the
protocol/mode database, and some constructors and methods for manually
specifying processing modes. The processing modes for standardised URLs
(previous and default behaviour) and non-standard but processable URLs (called
RawURI in the code, which on second thought might be incorrect) are
implemented; I've added database support but no implementation for mailto style
URIs as well.
Oh, I should add that the parser looks for "urimode=<mode>" in the relevant .protocol file, where <mode> is either "rawuri", "mailto" or anything else (which signifies a standard URL). Ok, I've reviewed your patch and I have two comments about it. I'll apply it to my local tree in my next KDE recompile (two weeks from now, I've got some other time constraints) and help you testing it. Anyways, my comments are: 1) I don't like the URIMODE_ prefix (all capitals with an underscore) in the enum. In my view, it's ugly, but that's me. We'll see if the later if the coding style guide says anything about that. 2) I think you should be bold and move all the processing into different parse functions, and make KURL::parse simply select the parser function from the uriMode. That way, we'd have KURL::parseNormalURI being a copy of the current KURL::parse, then we'd also have KURL::parseRawURI (essentially a no-op) and a future KURL::parseMailtoURI. That would make the code cleaner. Oh, I should also say that the URI RFC (2396) specifies that all URIs have the "schema" part (what we call protocol). So that part of the code can be inside KURL::parse, which would also be required in order to automatically find the URI Mode. > 1) I don't like the URIMODE_ prefix (all capitals with an underscore) in the > enum. In my view, it's ugly, but that's me. We'll see if the later if the coding > style guide says anything about that. I originally didn't use a prefix there, but I got worried about namespace collisions. The prefix style was inspired by the enums in KIO, but I'll be the first to agree with you about their ugliness. Possibly we should just do away with the prefix altogether? As long as they're capitalised, at least they won't collide with any method names. > 2) I think you should be bold and move all the processing into different parse > functions, and make KURL::parse simply select the parser function from the > uriMode. That way, we'd have KURL::parseNormalURI being a copy of the current > KURL::parse, then we'd also have KURL::parseRawURI (essentially a no-op) and a > future KURL::parseMailtoURI. That would make the code cleaner. Good idea - the way raw URIs are parsed right now is identical to the standard parser, and with ed2k URLs what actually happens is that the whole URL aside from the protocol ends up in the host field. This is syntactically correct if you assume ed2k is standards compliant, but extremely counter-intuitive - it would probably be a lot better to just put the whole URI (not including the protocol) into the path field instead of trying to parse it as a standard URL without the IDNA code applied on the host name like it's done right now. Another thing that bothers me: I've tried to maintain binary compatibility as much as possible, but I think I broke it after all, judging by Konqueror's misbehaving before I did a complete kdebase recompile. The three extra constructors I added are a result of this, but I find them less than pleasing - I'd rather put the URIMode argument into the original three as a default argument. Is that a big issue? 1) They are already namespace-safe because the enum is declared inside the KURL class. As long as nobody goes about adding methods with the same names as the enum constants, we'll be fine. And that won't happen because method names should always start with a lowercase letter (name it myFunctionLikeThis), while the enum constants start with a capital letter (MyEnumConstant). And we don't do underscore hell. :-) 2) No, you can't change the existing constructors without breaking binary compatibility. You can, however, make your new constructors look like the old ones and add a comment like this: ## BCI KDE 4.0: merge with above (BCI stands for Binary Compatibility Issue) The way I see it, KURL needs a cleanup for KDE 4.0. I'd actually vote for having it be subclassed into the different URI modes. (KURL would be a misnomer for KURI, but that's what is used throughout) My first glance at your code did not spot any Binary Incompatible Changes. It should be safe to run without rebuilding kdebase. Your problems might be caused by one of your earlier builds. I'll test that when I build it myself. Actually, scratch that. I've just spotted the BIC: you've added a new member variable to the class, increasing its size. You can't do that. Add it to the d pointer and change your uriMode () inline function into a full one. Created attachment 2241 [details]
Processing modes, second attempt.
Here's a new patch. I've not been able to tell in practice if it breaks binary
compatibility, as I had to recompile my whole KDE again because it certainly
breaks compatibility with the previous patch, but as far as I can tell it looks
safe.
I've also split the parse function into one for each mode, as suggested, and
while I was at it I implemented the mailto parsing with IDNA conversion of the
host part of the email address.
Created attachment 2311 [details]
Processing modes, updated.
There have been some conflicting commits to the KURL code lately, so I thought
I'd make a new patch. After two weeks of testing, the code seems stable; can I
commit it soon? :)
Well, I'm still unable to compile or run KDE, but I hope this situation will change later this week. I recommend, though, you post the patch to kde-core- devel and ask for comments and more testers. Urls are case-insensitive. So there is no deed to change valid URLs to lower case. If, for example, an user writes http://ReallyLongUrl.com because it's better readable then respect this. The URL gets lowercased because it's part of the normalisation process. That process is also responsible for changing the I may add my own wisdom here regarding data urls (rfc2397). data urls don't comply with [rfc1738, 3.1] at all, they have no host part and no path. Therefore, *everything* has to be treated literally. However, rfc2397 states they are subject to decoding. That means the data url data:,%31%30%30 should be decoded to data:,100 Interpreting the definition of KURL::RawURI, this won't happen. If this is the case (and I haven't misinterpreted it), data urls need another URIMode like RawDecodeable. I also think it's a bad idea to put the raw url into the path variable. The rationale is that for data urls, there is no such thing that is even remotely comparable to a path. DOM's HTMLAnchorElement::pathname, however, gets initialized with whatever KURL::path() returns which would be total crap in the case of data urls. But Javascript applications expect a valid pathname here (Mozilla, e. g. fills in nothing for pathname on data urls). Therefore I propose that with the exception of protocol() and url(), all component-retrieval methods return an empty string on raw urls. As url() returns the url with encodings left in place, maybe there should be an additional function urlDecoded(). Petter: Please include "data" in the list for autodetection if ( protocol == "ed2k" ) mode = RawURI; + else if ( protocol == "data" ) mode = RawURI; else if ( protocol == "mailto" ) mode = Mailto; Subject: Re: Lowercasing/normalising URLs breaks some non-compliant protocols > However, rfc2397 states they are subject to decoding. That means the data url > > data:,%31%30%30 > > should be decoded to > > data:,100 > > Interpreting the definition of KURL::RawURI, this won't happen. If this is the > case (and I haven't misinterpreted it), data urls need another URIMode like > RawDecodeable. I'm not sure about the definition, but the RawURI implementation, at least, does decode its input, because ed2k, sig2dat and magnet URIs want the same behaviour as data. In fact, I don't think there's ever going to be a situation where we don't want to decode %xx constructs. > [...] I propose that with the exception of protocol() and url(), all > component-retrieval methods return an empty string on raw urls. This might cause problems with mailto URIs, though - the old behaviour would return the email address as path(). In my implementation, they're treated as RawURI, but with extra processing for the host part, which is the reasoning behind putting it in path() in both cases. Then again, it's perfectly possible to make this behaviour exclusive to mailto. > As url() returns the url with encodings left in place, maybe there should be an > additional function urlDecoded(). Isn't that pretty much what prettyURL() does? > Petter: Please include "data" in the list for autodetection Done. Well, in this case data urls are all RawURI. As you tell it, mailto should be treated exceptionally. Doesn't prettyURL mean: Do any amount of incomprehensible, irreversible distortion to make the url pretty to read, but totally unsuitable for any further processing? Data urls have to be treated literally (the only processing allowed is %xx decoding). If prettyURL does exactly do this for RawURI, and this behaviour is guaranteed (i. e. doesn't start doing distortions in KDE 3.3/4.0 because of some "usability" study), everything's well. But deducing from the explanation in the docs "Returns the URL as string in human-friendly format.", I'm afraid this won't hold true for long because one day a programmer will come, and *do* distortions to make is more "human-friendly". Hello, this is to let you know that I'm back to my own computer. KDE recompilation is scheduled for tonight :-) Anyways, about the KURL::prettyURL() call: it's intended for things shown to the user, but not for internal data. That is, if we had got data:%31%41%E1, Konqueror would show in the status bar "data:1A%E1", as well as in the Location window if clicked. But the URL being copied into the clipboard or sent to the ioslaves would be the one from KURL::url() -- that is, untransformed. The same is also true for IDN domains: url() returns the ACE-encoded domain, which is NEVER to be shown to the user outside debugging purposes, whereas prettyURL() shows them in the normalised Unicode form. > Doesn't prettyURL mean: Do any amount of incomprehensible, irreversible
> distortion to make the url pretty to read, but totally unsuitable for any
> further processing?
Not at all. The golden rule of prettyURL is that
KURL( u.prettyURL() ) == u (or at least they're equivalent).
Otherwise pressing 'Enter' in the Konqueror location bar would lead to "invalid URL". So prettyURL gives a URL that _is_ suitable for further processing, and its transformations are usually reversible. Hmm did someone tell Waldo about this BR?
Anything happening here? It's getting dangerously close to the 3.2 release... Created attachment 3046 [details]
Processing modes patch without ugly parser
Here's a new patch. What's different: KProtocolInfo now understands the
"urimode" field in the protocol description file. KURL uses KProtocolInfo if
available to get the data it needs, instead of relying on that ugly standalone
parser. Added some tests for the new API stuff, and fixed some bugs exposed by
the test suite.
How does this look?
Thank you, it looks now much nicer. Too bad KURL has to have a d-pointer now. Anyways, there's just one thing missing I'd like to add: the automatic detection could check if the first two characters after the protocol and the colon are //. If they are, it should default to URL mode; if it doesn't, RawURI. That is, blah://txt is URL and therefore "txt" is supposed to be a hostname, whereas blah:txt is not URL and "txt" can be anything. This is the very reason why ed2k:// is broken. If it didn't have those //, it would be ok. PS: can you also add the @since 3.2 to the other members you added? > Anyways, there's just one thing missing I'd like to add: the automatic > detection could check if the first two characters after the protocol and > the colon are //. If they are, it should default to URL mode; if it > doesn't, RawURI. > > That is, blah://txt is URL and therefore "txt" is supposed to be a > hostname, whereas blah:txt is not URL and "txt" can be anything. This is > the very reason why ed2k:// is broken. If it didn't have those //, it > would be ok. Maybe I misunderstood what you meant, but that sounds to me like it would break a lot of things. If we do that, correctly formed URIs which don't start with // (like mailto, assuming that wasn't handled as a special case) would be parsed as raw URIs, meaning eg. "mailto:foo@bar.com?subject=Foobar" would get everything after the colon put into the path part, which is clearly incorrect. Moreover, all broken URLs I've come across so far (ed2k, sig2dat and slsk) all have the //, so making assumptions based on their presence doesn't make much sense either. I think it's safer to assume everything should be parsed as a valid URL (as it is today), unless explicitly told otherwise by a protocol file or the hardcoded fallback list. > PS: can you also add the @since 3.2 to the other members you added? Oops. Done. That's not what I meant. What I wanted to do is add an extra check for the automatic mode, for when KURL doesn't find a value for a given protocol. The decision would be thus like this: - if the mode is on cache, use it - if the .protocol file has the value, use it (and cache) - if the mode hasn't been detected, look at the URI schema (the protocol) .- if it's mailto:, set the mode to MailTo mode .- if it's one of the known broken schemas (ed2k, sig2dat, slsk), use RawURI and here's the change: .- if it's still unknown, but the non-schema part starts with //, it's URL .- otherwise, it's RawURI What the RFCs say is that anything://something-else is a URL and that's what those broken protocols violate (because they have // not followed by a hostname). Other schemas may be specified (like data or mailto) with different parsings. Look at these URLs (all of which are correctly parsed by KURL today): aim:goim?screenname=Foobar&message=Hello+world magnet:?xt=urn:sha1:PIXQRS5AZHRIXCBEJJS3W2JATUQ6XZZA&dn=foobar.zip file:/foo/bar/baz?foo=bar Assuming none of those protocols have a URIMode tag in their respective protocol files (file doesn't, and KDE doesn't even handle the other two), they would all be parsed as RawURI according to the method you suggest, which means KURL::queryItem() et al wouldn't work on them as expected, and, in the case of the file URL especially, KURL::path() would be unusable. RawURI doesn't parse anything besides putting whatever comes after the colon into m_strPath after expanding %xx constructs, which is clearly not what we want in these cases. It should be reserved for protocols which are explicitly known to be broken. Created attachment 3154 [details]
Processing modes with working mailto mode.
I broke the mailto processing in the last patch, so here's a working version,
along with the requested apidoc update. Is this starting to look committable?
It has been committable for some time. The problem is, it's never got applied. I just want someone with better knowledge of KURL than me to verify because it's a pretty intrusive patch (unlike anything I had done in KURL before). So far no one has stepped up to do that. Will have a look at it later this week. We have a slight problem with the current code. I've just fixed it halfway but the patch is not correct, so I'm not attaching anything. You may have noticed that Konqueror stopped caching correctly the webpages. I've spent a couple of hours tracing down the bug and I got it narrowed to KURL::url. The reason for that is that the caching is handled in kio_http, but the URL is read from a data stream, which doesn't and cannot carry the d->uriMode variable (the wire format is crystallised and cannot change before KDE4, no room for updates). That means that when HTTPProtocol::checkCacheEntry calls KURL::url to determine the cache, the value of d->uriMode can be undetermined and, therefore, the path name and query can be mishandled. The same could also be true for other cases. My recommendation is that operator>>(QDataStream&, KURL&) make sure that d->uriMode is properly set. I am currently testing a patch, will commit after Beta2 Subject: kdelibs CVS commit by waba: Make parsing of URLs protocol-dependant (BR62425) Based on a patch by Petter E. Stokke CCMAIL: 62425-done@bugs.kde.org M +26 -4 kdecore/kprotocolinfo_kdecore.cpp 1.8 M +4 -3 kdecore/kprotocolinfofactory.cpp 1.2 M +2 -2 kdecore/kprotocolinfofactory.h 1.4 M +169 -9 kdecore/kurl.cpp 1.261 M +54 -3 kdecore/kurl.h 1.130 M +11 -0 kdecore/tests/kurltest.cpp 1.82 M +1 -0 kio/data.protocol 1.6 M +17 -0 kio/kio/kprotocolinfo.h 1.28 Please test if the query part isn't lost in KURL::url(). I.e., go to Google, get a search. Then hit Next. If it shows the same page again instead of the next results, it's this bug. Created attachment 3664 [details]
KProtocolInfo::uriParseMode default value fix
KProtocolInfo::uriParseMode needs to return Auto if the URI mode is missing,
otherwise the fallback list is never used, causing mailto URLs to be parsed as
regular URLs in the absence of a mailto.protocol file (there isn't one in KDE).
Hi Thiago, should we reopen this report cause QUrl and with it KDE4's KUrl are not any longer able to deal with RAW-mode url's/uri's? probably somewhat related report is bug #159017 Hmm... looks like we lost the KURL fixes when we moved to QUrl. QUrl is to-the-letter strict. If your URI is not a URL, placing it in a QUrl will lose some of the data. |