Version: (using KDE Devel)
Installed from: Compiled sources
Since the IDNA changes were incorporated into KURL and friends, the hostnames are normalised internally according to the following RFCs:
RFC 3454 - Preparation of Internationalized Strings
RFC 3490 - Internationalizing Domain Names in Applications (IDNA)
RFC 3491 - Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)
RFC 3492 - Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)
That means that, among other modifications, that hostnames in URLs get lowercased. Actually, if libidn is not found, current KDE CVS code simply lowercases the URL. This behaviour is (as per my reading) acceptable according to:
STD 3 [mentioned several times in the rules in the RFCs above]
RFC 1738 - Uniform Resource Locators (URL)
However, there are some "URLs" in use out there that do not respect the rules given by the IDNA RFCs and there could be even those that disrespect STD 3. One such scheme is ed2k, which is known to encode a filename inside the hostname part of the URL.
My understanding is that calling ed2k:// a URL is wrong, but RFC 2396 (Uniform Resource Identifiers (URI): Generic Syntax) (URIs are a superset of URLs) allows that. Therefore, ed2k:// and others might be valid URIs even though they are not valid URLs.
This has been a bit of summary of the thread in kde-devel:
Anyways, as suggested in that thread, KURL should support some kind of different URIs that are not URLs and be aware of their peculiarities. I'm here proposing three different URI types:
internet or url - normal URLs, as they are treated today
mailto - URIs containing mail addresses; IDNA needs to be applied to the hostname part (after the @)
raw - don't touch, don't encode
I've opened this bug report to keep track of the support and also for the information as to why some broken "URLs" broke in KDE CVS/3.2.
Moving the discussion here.
Last message from the list was (from Petter Stokke):
My current working copy uses a simple hardcoded list ("ed2k" is raw,
"mailto" is mailto, and everything else is a normal URL), I guess it could
fall back on that if no KInstance is available.
> But you should also provide a means of the user setting the
> URI type and handling method.
I added that as an option to the constructor, with the present processing
mode being the default one.
> KURL should be lightweight. So, I don't recommend writing code to find
> files in the disk.
I'm trying hard to keep it lightweight, but I don't know any other way to
do it except to hardcode the protocol list into KURL, which doesn't really
appeal to me. I was planning on writing a very lightweight parser to fetch
the protocol mode from the .protocol file, and storing it in a hard cache
for future use (so it doesn't care if the .protocol file changes) in order
to maintain some semblance of efficiency.
Agreed on the part of the no-rereading. Protocols shouldn't reasonably change within the
lifetime of a program.
Another solution would be to code this list inside kdeglobals. It would then be a static list,
instead of one that each protocol could determine the handling method.
One way or another, I recommend using a QDict hashing table for faster lookups.
I've used a QMap, it seems a bit excessive to have to allocate memory for
storing ints in a QDict. But that brings up another issue: should I add locks
around the table access to ensure thread safety, or is KURL in general not
For now, I've written a simple parser using QRegExp to pull the relevant data
out of the .protocol file if KGlobal::activeInstance() returns non-NULL and we
can thus determine the location of the files, defaulting to a hardcoded list
inside KURL if it does not, or if kdecore is being compiled with KDE_QT_ONLY
defined. The whole database could be moved elsewhere if it seems practical, but
for the moment I've tried to keep KURL as self-sufficient as possible.
Created attachment 2222 [details]
Support for URI processing modes and per-protocol mode selection.
Here's the proposed patch. I'm still trying to find things which break after
applying it, but it seems to have become stable enough for review. At least
KHTML seems to be quite OK with it.
It adds a static method to KURL (uriModeForProtocol) for accessing the
protocol/mode database, and some constructors and methods for manually
specifying processing modes. The processing modes for standardised URLs
(previous and default behaviour) and non-standard but processable URLs (called
RawURI in the code, which on second thought might be incorrect) are
implemented; I've added database support but no implementation for mailto style
URIs as well.
Oh, I should add that the parser looks for "urimode=<mode>" in the relevant
.protocol file, where <mode> is either "rawuri", "mailto" or anything else
(which signifies a standard URL).
Ok, I've reviewed your patch and I have two comments about it. I'll apply it to my local
tree in my next KDE recompile (two weeks from now, I've got some other time constraints)
and help you testing it.
Anyways, my comments are:
1) I don't like the URIMODE_ prefix (all capitals with an underscore) in the enum. In my
view, it's ugly, but that's me. We'll see if the later if the coding style guide says anything
2) I think you should be bold and move all the processing into different parse functions,
and make KURL::parse simply select the parser function from the uriMode. That way, we'd
have KURL::parseNormalURI being a copy of the current KURL::parse, then we'd also have
KURL::parseRawURI (essentially a no-op) and a future KURL::parseMailtoURI. That would
make the code cleaner.
Oh, I should also say that the URI RFC (2396) specifies that all URIs have the "schema" part
(what we call protocol). So that part of the code can be inside KURL::parse, which would
also be required in order to automatically find the URI Mode.
> 1) I don't like the URIMODE_ prefix (all capitals with an underscore) in the
> enum. In my view, it's ugly, but that's me. We'll see if the later if the coding
> style guide says anything about that.
I originally didn't use a prefix there, but I got worried about namespace
collisions. The prefix style was inspired by the enums in KIO, but I'll be the
first to agree with you about their ugliness. Possibly we should just do away
with the prefix altogether? As long as they're capitalised, at least they won't
collide with any method names.
> 2) I think you should be bold and move all the processing into different parse
> functions, and make KURL::parse simply select the parser function from the
> uriMode. That way, we'd have KURL::parseNormalURI being a copy of the current
> KURL::parse, then we'd also have KURL::parseRawURI (essentially a no-op) and a
> future KURL::parseMailtoURI. That would make the code cleaner.
Good idea - the way raw URIs are parsed right now is identical to the standard
parser, and with ed2k URLs what actually happens is that the whole URL aside
from the protocol ends up in the host field. This is syntactically correct if
you assume ed2k is standards compliant, but extremely counter-intuitive - it
would probably be a lot better to just put the whole URI (not including the
protocol) into the path field instead of trying to parse it as a standard URL
without the IDNA code applied on the host name like it's done right now.
Another thing that bothers me: I've tried to maintain binary compatibility as
much as possible, but I think I broke it after all, judging by Konqueror's
misbehaving before I did a complete kdebase recompile. The three extra
constructors I added are a result of this, but I find them less than pleasing -
I'd rather put the URIMode argument into the original three as a default
argument. Is that a big issue?
1) They are already namespace-safe because the enum is declared inside the KURL class.
As long as nobody goes about adding methods with the same names as the enum
constants, we'll be fine.
And that won't happen because method names should always start with a lowercase letter
(name it myFunctionLikeThis), while the enum constants start with a capital letter
(MyEnumConstant). And we don't do underscore hell. :-)
2) No, you can't change the existing constructors without breaking binary compatibility. You
can, however, make your new constructors look like the old ones and add a comment like
## BCI KDE 4.0: merge with above
(BCI stands for Binary Compatibility Issue)
The way I see it, KURL needs a cleanup for KDE 4.0. I'd actually vote for having it be
subclassed into the different URI modes. (KURL would be a misnomer for KURI, but that's
what is used throughout)
My first glance at your code did not spot any Binary Incompatible Changes. It should be
safe to run without rebuilding kdebase. Your problems might be caused by one of your
earlier builds. I'll test that when I build it myself.
Actually, scratch that. I've just spotted the BIC: you've added a new member variable to the
class, increasing its size. You can't do that. Add it to the d pointer and change your uriMode
() inline function into a full one.
Created attachment 2241 [details]
Processing modes, second attempt.
Here's a new patch. I've not been able to tell in practice if it breaks binary
compatibility, as I had to recompile my whole KDE again because it certainly
breaks compatibility with the previous patch, but as far as I can tell it looks
I've also split the parse function into one for each mode, as suggested, and
while I was at it I implemented the mailto parsing with IDNA conversion of the
host part of the email address.
Created attachment 2311 [details]
Processing modes, updated.
There have been some conflicting commits to the KURL code lately, so I thought
I'd make a new patch. After two weeks of testing, the code seems stable; can I
commit it soon? :)
Well, I'm still unable to compile or run KDE, but I hope this situation will
change later this week. I recommend, though, you post the patch to kde-core-
devel and ask for comments and more testers.
Urls are case-insensitive. So there is no deed to change valid URLs to lower
If, for example, an user writes http://ReallyLongUrl.com because it's better
readable then respect this.
The URL gets lowercased because it's part of the normalisation process. That
process is also responsible for changing the
I may add my own wisdom here regarding data urls (rfc2397).
data urls don't comply with [rfc1738, 3.1] at all, they have no host part and
no path. Therefore, *everything* has to be treated literally.
However, rfc2397 states they are subject to decoding. That means the data url
should be decoded to
Interpreting the definition of KURL::RawURI, this won't happen. If this is the
case (and I haven't misinterpreted it), data urls need another URIMode like
I also think it's a bad idea to put the raw url into the path variable. The
rationale is that for data urls, there is no such thing that is even remotely
comparable to a path. DOM's HTMLAnchorElement::pathname, however, gets
initialized with whatever KURL::path() returns which would be total crap in the
(Mozilla, e. g. fills in nothing for pathname on data urls).
Therefore I propose that with the exception of protocol() and url(), all
component-retrieval methods return an empty string on raw urls. As url()
returns the url with encodings left in place, maybe there should be an
additional function urlDecoded().
Petter: Please include "data" in the list for autodetection
if ( protocol == "ed2k" ) mode = RawURI;
+ else if ( protocol == "data" ) mode = RawURI;
else if ( protocol == "mailto" ) mode = Mailto;
Subject: Re: Lowercasing/normalising URLs breaks some
> However, rfc2397 states they are subject to decoding. That means the data url
> should be decoded to
> Interpreting the definition of KURL::RawURI, this won't happen. If this is the
> case (and I haven't misinterpreted it), data urls need another URIMode like
I'm not sure about the definition, but the RawURI implementation, at
least, does decode its input, because ed2k, sig2dat and magnet URIs want
the same behaviour as data. In fact, I don't think there's ever going to
be a situation where we don't want to decode %xx constructs.
> [...] I propose that with the exception of protocol() and url(), all
> component-retrieval methods return an empty string on raw urls.
This might cause problems with mailto URIs, though - the old behaviour
would return the email address as path(). In my implementation, they're
treated as RawURI, but with extra processing for the host part, which is
the reasoning behind putting it in path() in both cases. Then again,
it's perfectly possible to make this behaviour exclusive to mailto.
> As url() returns the url with encodings left in place, maybe there should be an
> additional function urlDecoded().
Isn't that pretty much what prettyURL() does?
> Petter: Please include "data" in the list for autodetection
Well, in this case data urls are all RawURI.
As you tell it, mailto should be treated exceptionally.
Doesn't prettyURL mean: Do any amount of incomprehensible, irreversible
distortion to make the url pretty to read, but totally unsuitable for any
Data urls have to be treated literally (the only processing allowed is %xx
decoding). If prettyURL does exactly do this for RawURI, and this behaviour is
guaranteed (i. e. doesn't start doing distortions in KDE 3.3/4.0 because of
some "usability" study), everything's well.
But deducing from the explanation in the docs "Returns the URL as string in
human-friendly format.", I'm afraid this won't hold true for long because one
day a programmer will come, and *do* distortions to make is more
Hello, this is to let you know that I'm back to my own computer. KDE recompilation is
scheduled for tonight :-)
Anyways, about the KURL::prettyURL() call: it's intended for things shown to the user, but
not for internal data. That is, if we had got data:%31%41%E1, Konqueror would show in
the status bar "data:1A%E1", as well as in the Location window if clicked. But the URL being
copied into the clipboard or sent to the ioslaves would be the one from KURL::url() -- that
The same is also true for IDN domains: url() returns the ACE-encoded domain, which is
NEVER to be shown to the user outside debugging purposes, whereas prettyURL() shows
them in the normalised Unicode form.
> Doesn't prettyURL mean: Do any amount of incomprehensible, irreversible
> distortion to make the url pretty to read, but totally unsuitable for any
> further processing?
Not at all. The golden rule of prettyURL is that
KURL( u.prettyURL() ) == u (or at least they're equivalent).
Otherwise pressing 'Enter' in the Konqueror location bar would lead to "invalid URL". So prettyURL gives a URL that _is_ suitable for further processing, and its transformations are usually reversible. Hmm did someone tell Waldo about this BR?
Anything happening here? It's getting dangerously close to the 3.2 release...
Created attachment 3046 [details]
Processing modes patch without ugly parser
Here's a new patch. What's different: KProtocolInfo now understands the
"urimode" field in the protocol description file. KURL uses KProtocolInfo if
available to get the data it needs, instead of relying on that ugly standalone
parser. Added some tests for the new API stuff, and fixed some bugs exposed by
the test suite.
How does this look?
Thank you, it looks now much nicer. Too bad KURL has to have a d-pointer now.
Anyways, there's just one thing missing I'd like to add: the automatic detection could check if the first two characters after the protocol and the colon are //. If they are, it should default to URL mode; if it doesn't, RawURI.
That is, blah://txt is URL and therefore "txt" is supposed to be a hostname, whereas blah:txt is not URL and "txt" can be anything. This is the very reason why ed2k:// is broken. If it didn't have those //, it would be ok.
PS: can you also add the @since 3.2 to the other members you added?
> Anyways, there's just one thing missing I'd like to add: the automatic
> detection could check if the first two characters after the protocol and
> the colon are //. If they are, it should default to URL mode; if it
> doesn't, RawURI.
> That is, blah://txt is URL and therefore "txt" is supposed to be a
> hostname, whereas blah:txt is not URL and "txt" can be anything. This is
> the very reason why ed2k:// is broken. If it didn't have those //, it
> would be ok.
Maybe I misunderstood what you meant, but that sounds to me like it would break a lot of things. If we do that, correctly formed URIs which don't start with // (like mailto, assuming that wasn't handled as a special case) would be parsed as raw URIs, meaning eg. "mailto:email@example.com?subject=Foobar" would get everything after the colon put into the path part, which is clearly incorrect. Moreover, all broken URLs I've come across so far (ed2k, sig2dat and slsk) all have the //, so
making assumptions based on their presence doesn't make much sense either.
I think it's safer to assume everything should be parsed as a valid URL (as it is today), unless explicitly told otherwise by a protocol file or the hardcoded fallback list.
> PS: can you also add the @since 3.2 to the other members you added?
That's not what I meant. What I wanted to do is add an extra check for the automatic mode, for when KURL doesn't find a value for a given protocol.
The decision would be thus like this:
- if the mode is on cache, use it
- if the .protocol file has the value, use it (and cache)
- if the mode hasn't been detected, look at the URI schema (the protocol)
.- if it's mailto:, set the mode to MailTo mode
.- if it's one of the known broken schemas (ed2k, sig2dat, slsk), use RawURI
and here's the change:
.- if it's still unknown, but the non-schema part starts with //, it's URL
.- otherwise, it's RawURI
What the RFCs say is that anything://something-else is a URL and that's what those broken protocols violate (because they have // not followed by a hostname). Other schemas may be specified (like data or mailto) with different parsings.
Look at these URLs (all of which are correctly parsed by KURL today):
Assuming none of those protocols have a URIMode tag in their respective protocol files (file doesn't, and KDE doesn't even handle the other two), they would all be parsed as RawURI according to the method you suggest, which means KURL::queryItem() et al wouldn't work on them as expected, and, in the case of the file URL especially, KURL::path() would be unusable. RawURI doesn't parse anything besides putting whatever comes after the colon into m_strPath after expanding %xx constructs, which is clearly not what we want in these cases. It should be reserved for protocols which are explicitly known to be broken.
Created attachment 3154 [details]
Processing modes with working mailto mode.
I broke the mailto processing in the last patch, so here's a working version,
along with the requested apidoc update. Is this starting to look committable?
It has been committable for some time. The problem is, it's never got applied. I just want someone with better knowledge of KURL than me to verify because it's a pretty intrusive patch (unlike anything I had done in KURL before). So far no one has stepped up to do that.
Will have a look at it later this week.
We have a slight problem with the current code. I've just fixed it halfway but the patch is not correct, so I'm not attaching anything.
You may have noticed that Konqueror stopped caching correctly the webpages. I've spent a couple of hours tracing down the bug and I got it narrowed to KURL::url.
The reason for that is that the caching is handled in kio_http, but the URL is read from a data stream, which doesn't and cannot carry the d->uriMode variable (the wire format is crystallised and cannot change before KDE4, no room for updates).
That means that when HTTPProtocol::checkCacheEntry calls KURL::url to determine the cache, the value of d->uriMode can be undetermined and, therefore, the path name and query can be mishandled.
The same could also be true for other cases.
My recommendation is that operator>>(QDataStream&, KURL&) make sure that d->uriMode is properly set.
I am currently testing a patch, will commit after Beta2
CVS commit by waba:
Make parsing of URLs protocol-dependant (BR62425)
Based on a patch by Petter E. Stokke
M +26 -4 kdecore/kprotocolinfo_kdecore.cpp 1.8
M +4 -3 kdecore/kprotocolinfofactory.cpp 1.2
M +2 -2 kdecore/kprotocolinfofactory.h 1.4
M +169 -9 kdecore/kurl.cpp 1.261
M +54 -3 kdecore/kurl.h 1.130
M +11 -0 kdecore/tests/kurltest.cpp 1.82
M +1 -0 kio/data.protocol 1.6
M +17 -0 kio/kio/kprotocolinfo.h 1.28
Please test if the query part isn't lost in KURL::url().
I.e., go to Google, get a search. Then hit Next. If it shows the same page again instead of the next results, it's this bug.
Created attachment 3664 [details]
KProtocolInfo::uriParseMode default value fix
KProtocolInfo::uriParseMode needs to return Auto if the URI mode is missing,
otherwise the fallback list is never used, causing mailto URLs to be parsed as
regular URLs in the absence of a mailto.protocol file (there isn't one in KDE).
should we reopen this report cause QUrl and with it KDE4's KUrl are not any longer able to deal with RAW-mode url's/uri's?
probably somewhat related report is bug #159017
Hmm... looks like we lost the KURL fixes when we moved to QUrl.
QUrl is to-the-letter strict. If your URI is not a URL, placing it in a QUrl will lose some of the data.