Summary: | Recoded data in multipart/form-data | ||
---|---|---|---|
Product: | [Applications] konqueror | Reporter: | Tim Landscheidt <tim> |
Component: | khtml | Assignee: | Konqueror Developers <konq-bugs> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | chrislb, finex, maarten, maksim |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Fedora RPMs | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: |
Description
Tim Landscheidt
2007-12-15 21:01:50 UTC
Well, submitting a bug about Konqueror with Konqueror seems to be a little problem :-). Please check the mentioned link for the raw data. confirmed on konqueror 4 (trunk r797319) Still a bug in 4.2. Sorry, but I don't see the preview button there? Sorry, that is due to the circumstance that the article cannot be edited by anonymous users. Please try <URI:http://de.wikipedia.org/w/index.php?title=Gotische_Sprache&action=edit> instead. The button is labelled "Ă„nderungen zeigen" ("Show changes"). Thanks.. What seems to happen is the following: that text is outside the Basic Multilingual Plane, so in UTF-16 it gets represented as a surrogate pair... Then, when we're serializing it out into utf-8, the encoding of the first half of the pair succeeds, and of the second fails, so it gets escaped into stuff like � ... and then because there is only half of a pair followed by & it (probably) gets swallowed up by the encoding pass. The following may be one approach, but I want to consult w/some people before committing it.. Hmm, do we even need to do the escaping pass for a unicode codec like utf-8 in the first place? --- html/html_formimpl.cpp (revision 925164) +++ html/html_formimpl.cpp (working copy) @@ -201,11 +201,17 @@ inline static QString escapeUnencodeable(const QTextCodec* codec, const QString& s) { QString enc_string; const int len = s.length(); + + // Workaround below: the utf8 codec reports it can't encode the second half of a surrogate + // pair, so we need to force-feed it to it + // ### this may not quite right if it's malformed, though + bool utf8 = (codec->mibEnum() == 106); + for(int i=0; i <len; ++i) { const QChar c = s[i]; - if (codec->canEncode(c)) + if (codec->canEncode(c) || (utf8 && 0xDC00 <= c && c <= 0xDFFF)) { enc_string.append(c); - else { + } else { QString ampersandEscape; ampersandEscape.sprintf("&#%u;", c.unicode()); enc_string.append(ampersandEscape); SVN commit 929608 by orlovich: Make sure we properly group surrogate pairs when letting the codec check whether it can encode them or not, so we don't mess up non-BMP characters (which can show up on wikipedia, at least) BUG: 154142 M +36 -12 html_formimpl.cpp WebSVN link: http://websvn.kde.org/?view=rev&revision=929608 SVN commit 929611 by orlovich: Merged revision 929608: Make sure we properly group surrogate pairs when letting the codec check whether it can encode them or not, so we don't mess up non-BMP characters (which can show up on wikipedia, at least) BUG: 154142 M +36 -12 html_formimpl.cpp WebSVN link: http://websvn.kde.org/?view=rev&revision=929611 *** Bug 180416 has been marked as a duplicate of this bug. *** |