Bug 261223 - Download name of link targets containing special characters get ignored
Summary: Download name of link targets containing special characters get ignored
Status: RESOLVED FIXED
Alias: None
Product: kio
Classification: Frameworks and Libraries
Component: http (show other bugs)
Version: SVN
Platform: openSUSE Linux
: NOR normal
Target Milestone: ---
Assignee: webkit-devel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-12-25 13:38 UTC by squan
Modified: 2011-03-06 21:32 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In: 4.6.2


Attachments
screenshot of prompt... (35.69 KB, image/png)
2011-01-03 02:45 UTC, Dawit Alemayehu
Details
proposed patch.. (1.51 KB, patch)
2011-01-06 21:27 UTC, Dawit Alemayehu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description squan 2010-12-25 13:38:08 UTC
Version:           unspecified (using KDE 4.5.90) 
OS:                Linux

Download names of javascript link targets which contain e.g. german umlauts are not handled  properly with all KDE browsers (namely konqueror, rekonq, midori).
E.g a download link on save.tv
which javascript:STV.Archive.Download.prepareDownload(true)
which will offer the file name on the server as name for the download file
while the correct name would be
   König_der_Könige_24-12-2010_2015.avi
With prominent browsers like firefox or chromium this just works.

This is not webkit specific as konqueror with khtml has the same problem.


Reproducible: Always
Comment 1 Dawit Alemayehu 2010-12-26 06:16:46 UTC
First, if the issue is common in both khtml and kdewebkit, then it must originate in a common component like kio_http and should not be reported against kwebkitpart specifically.  Secondly, without a sample link it would be impossible for us to reproduce the problem. As you have to obtain and provide to us the http header from the server. You can find the instruction on how to do that at

http://www.konqueror.org/investigatebug/#http
Comment 2 squan 2011-01-02 22:22:16 UTC
The link URLs look like this:
http://csn01.save.tv/13225552_CB3D53E19774F97FDA5C710E044C5072_0/?m=dl
The server is currently not very responsive, but this unrelated to the bug.
For the given URL the target filename is without special characters but due to another (more recent) bug it will not be presented either (with KDE browsers).
To see the correct target filename try e.g. firefox.

> ...and should not be reported against kwebkitpart specifically
ok,  component may be kio_http or kurifilter (just uneducated guesses)
Comment 3 Dawit Alemayehu 2011-01-03 02:44:10 UTC
(In reply to comment #2)
> The link URLs look like this:
> http://csn01.save.tv/13225552_CB3D53E19774F97FDA5C710E044C5072_0/?m=dl
> The server is currently not very responsive, but this unrelated to the bug.

I am presented with the dialog box shown in the attached screenshot. Is that the correct name you were exepecting ? Here is what the request and response look like from  kio_http's prespective:

HTTPProtocol::sendQuery: ============ Sending Header:
HTTPProtocol::sendQuery: "GET /13225552_CB3D53E19774F97FDA5C710E044C5072_0/?m=dl HTTP/1.1"
HTTPProtocol::sendQuery: "Host: csn01.save.tv"
HTTPProtocol::sendQuery: "Connection: Keep-Alive"
HTTPProtocol::sendQuery: "User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.16 (KHTML, like Gecko) konqueror/4.6.40 Safari/534.16"
HTTPProtocol::sendQuery: "Pragma: no-cache"
HTTPProtocol::sendQuery: "Cache-control: no-cache"
HTTPProtocol::sendQuery: "Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"
HTTPProtocol::sendQuery: "Accept-Encoding: x-gzip, x-deflate, gzip, deflate"
HTTPProtocol::sendQuery: "Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5"
HTTPProtocol::sendQuery: "Accept-Language: en-US,en;q=0.9"
HTTPProtocol::readResponseHeader: ============ Received Status Response:
HTTPProtocol::readResponseHeader: "HTTP/1.1 200 OK"
HTTPProtocol::readResponseHeader: wasAuthError= false isAuthError= false sameAuthError= false
HTTPProtocol::readResponseHeader:  -- full response:
"HTTP/1.1 200 OK
Content-Type: application/octet-stream
Connection: close
Server: Save.TV/2.51
Content-Disposition: attachment; filename=Eine_Familie_namens_Beethoven_01-01-2011_1435_312315.avi
Content-Length: 980521140
Accept-Ranges: bytes
Last-Modified: Sun, 2 Jan 2011 22:24:22 +0100
Etag: "13225552_CB3D53E19774F97FDA5C710E044C5072_0""

> For the given URL the target filename is without special characters but due to
> another (more recent) bug it will not be presented either (with KDE browsers).

If you are talking about kdewebkit browsers only in the most recent KDE 4.6 RC1 release, then it is a known issue that has already been fixed in the 4.6 branch.

> To see the correct target filename try e.g. firefox.

I get the same thing in the following: Firefox, Chromium, Konqueror + kwebkitpart, Konqueror + khtml and reKonq.

> > ...and should not be reported against kwebkitpart specifically
> ok,  component may be kio_http or kurifilter (just uneducated guesses)

If there is a bug that affects both kdewekit and khtml based browsers, then it must be in the lower level class KIO related classes. KUriFilter has nothing to do with such issues at all. Anyhow, I still would need a link that returns the afforementioned german umaluts in order to try and duplicate the problem ; so we can find out where it might be...
Comment 4 Dawit Alemayehu 2011-01-03 02:45:12 UTC
Created attachment 55494 [details]
screenshot of prompt...
Comment 5 squan 2011-01-03 08:58:21 UTC
> Is that the correct name you were exepecting?
Yes, that's how it should be.

> If you are talking about kdewebkit browsers only in the most recent
> KDE 4.6 RC1 release, then it is a known issue that has 
> already been fixed in the 4.6 branch.
Yes. Obviously with this fix it works correct in case of no umlauts.

> I still would need a link that returns the afforementioned german umlauts
Pardon, yesterday I had no such link, so I produced a new one over night:
http://csk13.save.tv/12533752_5BC5E6940F545D5455104B3984D2AC91_0/?m=dl
Comment 6 Dawit Alemayehu 2011-01-03 18:56:24 UTC
(In reply to comment #5)
> > Is that the correct name you were exepecting?
> Yes, that's how it should be.
> 
> > If you are talking about kdewebkit browsers only in the most recent
> > KDE 4.6 RC1 release, then it is a known issue that has 
> > already been fixed in the 4.6 branch.
> Yes. Obviously with this fix it works correct in case of no umlauts.
> 
> > I still would need a link that returns the afforementioned german umlauts
> Pardon, yesterday I had no such link, so I produced a new one over night:
> http://csk13.save.tv/12533752_5BC5E6940F545D5455104B3984D2AC91_0/?m=dl

Here is the request/response for the link above:

HTTPProtocol::sendQuery: ============ Sending Header:
HTTPProtocol::sendQuery: "GET /12533752_5BC5E6940F545D5455104B3984D2AC91_0/?m=dl HTTP/1.1"
HTTPProtocol::sendQuery: "Host: csk13.save.tv"
HTTPProtocol::sendQuery: "Connection: Keep-Alive"
HTTPProtocol::sendQuery: "User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/534.16 (KHTML, like Gecko) konqueror/4.6.40 Safari/534.16"
HTTPProtocol::sendQuery: "Pragma: no-cache"
HTTPProtocol::sendQuery: "Cache-control: no-cache"
HTTPProtocol::sendQuery: "Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"
HTTPProtocol::sendQuery: "Accept-Encoding: x-gzip, x-deflate, gzip, deflate"
HTTPProtocol::sendQuery: "Accept-Charset: utf-8, utf-8;q=0.5, *;q=0.5"
HTTPProtocol::sendQuery: "Accept-Language: en-US,en;q=0.9"
HTTPProtocol::readResponseHeader: ============ Received Status Response:
HTTPProtocol::readResponseHeader: "HTTP/1.1 200 OK"
HTTPProtocol::readResponseHeader: wasAuthError= false isAuthError= false sameAuthError= false
HTTPProtocol::readResponseHeader:  -- full response:
"HTTP/1.1 200 OK
Content-Type: application/octet-stream
Connection: close
Server: Save.TV/2.51
Content-Disposition: attachment; filename=SpongeBob_Schwammkopf_Gedächtnisverlust_02-01-2011_2330_119407.avi
Content-Length: 221861368
Accept-Ranges: bytes
Last-Modified: Mon, 3 Jan 2011 08:39:23 +0100
Etag: "12533752_5BC5E6940F545D5455104B3984D2AC91_0""

The problem stems from the fact that you do not quote the filename in the content-disposition header as specified in section 19.5.1 of RFC 2616. However, quoting the filename is not strictly enforced and so kio_http should handle it correctly. Unfortunately a programming error prevents unquoted filenames with non-latin1 characters from being parsed correctly in kio_http and hence this bug. Reassigning to kio_http...
Comment 7 Rolf Eike Beer 2011-01-04 00:14:00 UTC
I'll have a look if quoting is required or not.
Comment 8 Dawit Alemayehu 2011-01-06 21:27:03 UTC
Created attachment 55657 [details]
proposed patch..
Comment 9 Rolf Eike Beer 2011-01-06 23:37:22 UTC
The browsers behave absolutely correct: they ignore wrong values. The server _must_ send everything that does not fall into US-ASCII as quoted string (and even strings that have a special set of US-ASCII characters).

A header is a key=value field, where value may either be a "token" or a "quoted string". Of course there are rules for tokens:

This comes from RfC 2616, section 2.2 (http://tools.ietf.org/html/rfc2616#section-2.2):

CHAR           = <any US-ASCII character (octets 0 - 127)>
token          = 1*<any CHAR except CTLs or separators>

This RfC also says:

Many HTTP/1.1 header field values consist of words separated by LWS or special characters. These special characters MUST be in a quoted string to be used within a parameter value (as defined in section 3.6).

So the server just sends garbage and the browser just ignores it. There would be other ways of solving this problem by guessing, but guessing would fail sometimes and make the code much more complicated. And at the end that's were rules are for, to make it clear how this stuff has to look like. So please go and bug save.tv to fix their JavaScript.
Comment 10 squan 2011-01-09 16:22:39 UTC
> The server _must_ send everything that does not fall into US-ASCII as quoted string
> (and even strings that have a special set of US-ASCII characters).
Ok, so that shows that save.tv server clearly violates web standards.
But please don't expect save.tv to be a rare exception but take it as just an example.
Firefox and chromium (e.g.) work around this RfC violation and properly extract the filename from the server response. Expect that all major browsers do like this.
And so should KDE browsers.
Just proving that the server isn't RfC compliant is not a sufficient justification to close this bug as INVALID.

> So the server just sends garbage and the browser just ignores it.
The attribute value in question is no garbage but just missing the quotes required by RfC 2216.

> There would be other ways of solving this problem by guessing,
> but guessing would fail sometimes and make the code much more complicated.
Just disdarding the filename attribute of the server response severely hampers downloads because the user has to retrieve and enter it manually in each single case.
May be that diffentiation of a token from a string missing quotes will be a bit messy, but I think fixing this use case for all KDE browsers is worth it.

> So please go and bug save.tv to fix their JavaScript.
Inspite major browsers coping with this RfC incompliance without problem I have not much hope to succeed with that.
Comment 11 Rolf Eike Beer 2011-01-09 17:03:38 UTC
>> So the server just sends garbage and the browser just ignores it.
> The attribute value in question is no garbage but just missing the quotes
> required by RfC 2216.

So it is invalid, which qualifies it as garbage.

See http://greenbytes.de/tech/tc2231/#attwithasciifilenamenqws for basically the same example and how basically all browsers misbehave. Note the green line of Konqueror. You could ask Julian if this shouldn't be a expected fallback behaviour.

>> So please go and bug save.tv to fix their JavaScript.
> Inspite major browsers coping with this RfC incompliance without problem I have
> not much hope to succeed with that.

AFAICS they replaced spaces by underscores in the filenames, likely because that caused some trouble. If you tell them that quoting solves more of their problems maybe they will listen. Test it out.
Comment 12 Dawit Alemayehu 2011-01-09 18:28:55 UTC
(In reply to comment #11)
> >> So the server just sends garbage and the browser just ignores it.
> > The attribute value in question is no garbage but just missing the quotes
> > required by RfC 2216.
> 
> So it is invalid, which qualifies it as garbage.

I believe you are incorrect about this whole thing:

First, the "Content-Disposition" header is not an HTTP standard. It is simply a derived one that relies on other standards for its definition as well as security implications. See RFC 2616 section 15.5. As such the sections of RFC 2616 in comment #9 are not technically applicable to this header. You would be better served by looking at RFCs 2183 and RFC 5987 if you have not already done so. Specially RFC 5987, which is not only more recent than RFC 2616, but it also has specific recommendations about the parameters in content-disposition, see
section 3.2.1.

Second, there is actually a prefect reason why file names with LWSP should be quoted. It avoids ambiguity for parsers that might rely on white spaces to create tokens. However, there is no reason or justification for following that for non-latin1 characters that are not
special characters like control and LWSP characters.

Third, and we currently violate the spec when the file name are quoted. There is no check for illegal characters such as control characters when parsing quoted file names.

Anyhow, to me even if that was not the case and we did everything right, but all the other browsers implemented something differently, then the defacto standard is the one that should be chosen. There is absolutely no way you can adhere to RFCs 100% and get anything that is
usable in the real world. A little pragmatism is necessary when implementing these standards. Believe I learned that the hard way as well after implementing several RFCs including 2616 for which I had to write many workarounds for brain-dead implementations...


> See http://greenbytes.de/tech/tc2231/#attwithasciifilenamenqws for basically
> the same example and how basically all browsers misbehave. 

Ahhh... that test is for a white space... How are white spaces and valid non-latin1 characters the same ? And why should they be treated the same ?

> Note the green line of Konqueror. You could ask Julian if this shouldn't be a 
> expected fallback behaviour.

And the patch I propsed does not break that test, but fixes the problem reported here.
Comment 13 Rolf Eike Beer 2011-01-09 19:52:19 UTC
http://www.ietf.org/rfc/rfc5987.txt, first sentence on page 5:

   Thus, a parameter is either a regular parameter (reg-parameter), as
   previously defined in Section 3.6 of [RFC2616], or an extended
   parameter (ext-parameter).
Comment 14 Dawit Alemayehu 2011-01-10 20:22:36 UTC
(In reply to comment #13)
> http://www.ietf.org/rfc/rfc5987.txt, first sentence on page 5:
> 
>    Thus, a parameter is either a regular parameter (reg-parameter), as
>    previously defined in Section 3.6 of [RFC2616], or an extended
>    parameter (ext-parameter).

I had a email exchange about this issue with Julian, the author of the content-disposition test site, and asked him to add a test case for this particular scenario. He was receptive to the idea and hopefully he would do so at some point... He also seems to support your position on the spec ; so I will defer on whether or not we should workaround this problem to you... 

I personally think that we should simply workaround this issue specially since we can do so easily without violating the other requirements of the spec and because this problem is no different than many of the other spec violations for which we already provide workarounds in kio_http.
Comment 15 Dawit Alemayehu 2011-01-10 20:28:13 UTC
Thanks to Julian, the test cases are already there:

http://greenbytes.de/tech/tc2231/#attfnbrokentokeniso
http://greenbytes.de/tech/tc2231/#attfnbrokentokenutf
Comment 16 Dawit Alemayehu 2011-03-01 19:15:41 UTC
I am going to apply the patch I proposed previously as a workaround for this issue. It absolutely makes no sense for us to rigidly adhere to the RFC when none of the other client apps do. BTW, this is no different than any of the other RFC violations we had implemented in kio_http to date, e.g. treating a 302 and 303 server responses the same way.
Comment 17 Rolf Eike Beer 2011-03-01 19:26:57 UTC
Agreed. But I find the isPrint() conversion doubtful. Please also make sure all existing testcases work properly.
Comment 18 Dawit Alemayehu 2011-03-01 22:20:19 UTC
(In reply to comment #17)
> Agreed. But I find the isPrint() conversion doubtful. Please also make sure all
> existing testcases work properly.

Why do you think the isPrint() conversion is doubtful ? It correctly excludes every control character, which includes the vertical and horizontal spacers (\v, \t, \n, \r), while making it possible to allow unicode characters present in the string.

On top of that all the test cases pass as well...
Comment 19 Rolf Eike Beer 2011-03-01 22:44:49 UTC
> > Agreed. But I find the isPrint() conversion doubtful. Please also make
> > sure all existing testcases work properly.
> 
> Why do you think the isPrint() conversion is doubtful ? It correctly
> excludes every control character, which includes the vertical and
> horizontal spacers (\v, \t, \n, \r), while making it possible to allow
> unicode characters present in the string.
> 
> On top of that all the test cases pass as well...

Hm, that change seems to be one of the central points of this change, no? ;) 
Ok, ignore me then. I'll review that one tomorrow again when I'm less 
confused.
Comment 20 Dawit Alemayehu 2011-03-06 20:56:40 UTC
(In reply to comment #19)
> > > Agreed. But I find the isPrint() conversion doubtful. Please also make
> > > sure all existing testcases work properly.
> > 
> > Why do you think the isPrint() conversion is doubtful ? It correctly
> > excludes every control character, which includes the vertical and
> > horizontal spacers (\v, \t, \n, \r), while making it possible to allow
> > unicode characters present in the string.
> > 
> > On top of that all the test cases pass as well...
> 
> Hm, that change seems to be one of the central points of this change, no? ;) 
> Ok, ignore me then. I'll review that one tomorrow again when I'm less 
> confused.

Yes, the use of isPrint() is esentially the fix and the unit test for it shows no problems:

$ ./httpheaderdispositiontest
********* Start testing of HeaderDispositionTest *********
Config: Using QTest library 4.7.2, Qt 4.7.2
PASS   : HeaderDispositionTest::initTestCase()
PASS   : HeaderDispositionTest::runAllTests()
PASS   : HeaderDispositionTest::cleanupTestCase()
Totals: 3 passed, 0 failed, 0 skipped
********* Finished testing of HeaderDispositionTest *********

So in it goes. :)
Comment 21 Dawit Alemayehu 2011-03-06 21:29:32 UTC
Git commit 0672bbd10827a95b81fc3358c29fb10cab12ec02 by Dawit Alemayehu.
Committed on 06/03/2011 at 21:09.
Pushed by adawit into branch 'KDE/4.6'.

Allow unicode characters in unquoted filename parameters of the
content-disposition header.

BUG:261223
FIXED-IN:4.6.2

M  +10   -5    kioslave/http/parsinghelpers.cpp     

http://commits.kde.org/kdelibs/0672bbd10827a95b81fc3358c29fb10cab12ec02
Comment 22 Dawit Alemayehu 2011-03-06 21:32:56 UTC
Git commit dec857b341e11987b55b2f80d15be34a5e9131d0 by Dawit Alemayehu.
Committed on 06/03/2011 at 21:28.
Pushed by adawit into branch 'master'.

Allow unicode characters in unquoted filename parameters of the
content-disposition header.

BUG:261223
FIXED-IN:4.6.2

M  +10   -5    kioslave/http/parsinghelpers.cpp     

http://commits.kde.org/kdelibs/dec857b341e11987b55b2f80d15be34a5e9131d0