Bug 154142

Summary: Recoded data in multipart/form-data
Product: [Applications] konqueror Reporter: Tim Landscheidt <tim>
Component: khtmlAssignee: Konqueror Developers <konq-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: chrislb, finex, maarten, maksim
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Fedora RPMs   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Tim Landscheidt 2007-12-15 21:01:50 UTC
Version:            (using KDE KDE 3.5.8)
Installed from:    Fedora RPMs
OS:                Linux

If you edit the German Wikipedia article on Danmark (<URI:http://de.wikipedia.org/w/index.php?title=D%C3%A4nemark&action=edit>) and - without editing - immediately press "Ă„nderungen zeigen" ("Show changes"), the line:

| [[got:&#55296;&#57139;&#55296;&#57136;&#55296;&#57149;&#55296;&#57136;&#55296;&#57148;&#55296;&#57136;&#55296;&#57154;&#55296;&#57146;&#55296;&#57155;]] 

is replaced by:

| [[got:&#55296;&#57139;&#55296;&#57136;&#55296;&#57149;&#55296;&#57136;&#55296;&#57148;&#55296;&#57136;&#55296;&#57154;&#55296;&#57146;&#55296;&#57155;]]

A test shows that:

- the "recoding" is done by Konqueror on the upload and
- the header of the corresponding part consists only of 'Content-Disposition: form-data; name="wpTextbox1"' which - by my understanding of <URI:http://www.w3.org/TR/html401/interact/forms.html#form-content-type> - would mean that the contents were 7-bit ASCII.

At the moment, this is a (very minor :-)) showstopper in this special application as MediaWiki cannot parse the interwikilink in the recoded line.
Comment 1 Tim Landscheidt 2007-12-15 21:03:37 UTC
Well, submitting a bug about Konqueror with Konqueror seems to be a little problem :-). Please check the mentioned link for the raw data.
Comment 2 FiNeX 2008-04-21 14:00:24 UTC
confirmed on konqueror 4 (trunk r797319)
Comment 3 Tim Landscheidt 2009-02-18 20:34:47 UTC
Still a bug in 4.2.
Comment 4 Maksim Orlovich 2009-02-18 21:45:32 UTC
Sorry, but I don't see the preview button there?
Comment 5 Tim Landscheidt 2009-02-18 22:25:48 UTC
Sorry, that is due to the circumstance that the article cannot be edited by anonymous users. Please try <URI:http://de.wikipedia.org/w/index.php?title=Gotische_Sprache&action=edit> instead. The button is labelled "Ă„nderungen zeigen" ("Show changes").
Comment 6 Maksim Orlovich 2009-02-19 02:17:59 UTC
Thanks.. What seems to happen is the following:
that text is outside the Basic Multilingual Plane, so in UTF-16 it gets represented as a surrogate pair... Then, when we're serializing it out into utf-8, the encoding of the first half of the pair succeeds, and of the second fails, so it gets escaped into stuff like &#55296; ... and then because there is only half of a pair followed by & it (probably) gets swallowed up by the encoding pass.
Comment 7 Maksim Orlovich 2009-02-19 02:30:38 UTC
The following may be one approach, but I want to consult w/some people before committing it.. Hmm, do we even need to do the escaping pass for a unicode codec like utf-8 in the first place?

--- html/html_formimpl.cpp      (revision 925164)
+++ html/html_formimpl.cpp      (working copy)
@@ -201,11 +201,17 @@
 inline static QString escapeUnencodeable(const QTextCodec* codec, const QString& s) {
     QString enc_string;
     const int len = s.length();
+
+    // Workaround below: the utf8 codec reports it can't encode the second half of a surrogate
+    // pair, so we need to force-feed it to it
+    // ### this may not quite right if it's malformed, though
+    bool utf8 = (codec->mibEnum() == 106);
+
     for(int i=0; i <len; ++i) {
         const QChar c = s[i];
-        if (codec->canEncode(c))
+        if (codec->canEncode(c) || (utf8 && 0xDC00 <= c && c <= 0xDFFF)) {
             enc_string.append(c);
-        else {
+        } else {
             QString ampersandEscape;
             ampersandEscape.sprintf("&#%u;", c.unicode());
             enc_string.append(ampersandEscape);
Comment 8 Maksim Orlovich 2009-02-21 18:54:45 UTC
SVN commit 929608 by orlovich:

Make sure we properly group surrogate pairs when letting the codec 
check whether it can encode them or not, so we don't mess up 
non-BMP characters (which can show up on wikipedia, at least)
BUG: 154142


 M  +36 -12    html_formimpl.cpp  


WebSVN link: http://websvn.kde.org/?view=rev&revision=929608
Comment 9 Maksim Orlovich 2009-02-21 18:59:47 UTC
SVN commit 929611 by orlovich:

Merged revision 929608:
Make sure we properly group surrogate pairs when letting the codec 
check whether it can encode them or not, so we don't mess up 
non-BMP characters (which can show up on wikipedia, at least)
BUG: 154142

 M  +36 -12    html_formimpl.cpp  


WebSVN link: http://websvn.kde.org/?view=rev&revision=929611
Comment 10 Maksim Orlovich 2009-03-13 13:49:17 UTC
*** Bug 180416 has been marked as a duplicate of this bug. ***