Version: (using KDE KDE 3.5.8) Installed from: Fedora RPMs OS: Linux If you edit the German Wikipedia article on Danmark (<URI:http://de.wikipedia.org/w/index.php?title=D%C3%A4nemark&action=edit>) and - without editing - immediately press "Änderungen zeigen" ("Show changes"), the line: | [[got:������������������]] is replaced by: | [[got:������������������]] A test shows that: - the "recoding" is done by Konqueror on the upload and - the header of the corresponding part consists only of 'Content-Disposition: form-data; name="wpTextbox1"' which - by my understanding of <URI:http://www.w3.org/TR/html401/interact/forms.html#form-content-type> - would mean that the contents were 7-bit ASCII. At the moment, this is a (very minor :-)) showstopper in this special application as MediaWiki cannot parse the interwikilink in the recoded line.
Well, submitting a bug about Konqueror with Konqueror seems to be a little problem :-). Please check the mentioned link for the raw data.
confirmed on konqueror 4 (trunk r797319)
Still a bug in 4.2.
Sorry, but I don't see the preview button there?
Sorry, that is due to the circumstance that the article cannot be edited by anonymous users. Please try <URI:http://de.wikipedia.org/w/index.php?title=Gotische_Sprache&action=edit> instead. The button is labelled "Änderungen zeigen" ("Show changes").
Thanks.. What seems to happen is the following: that text is outside the Basic Multilingual Plane, so in UTF-16 it gets represented as a surrogate pair... Then, when we're serializing it out into utf-8, the encoding of the first half of the pair succeeds, and of the second fails, so it gets escaped into stuff like � ... and then because there is only half of a pair followed by & it (probably) gets swallowed up by the encoding pass.
The following may be one approach, but I want to consult w/some people before committing it.. Hmm, do we even need to do the escaping pass for a unicode codec like utf-8 in the first place? --- html/html_formimpl.cpp (revision 925164) +++ html/html_formimpl.cpp (working copy) @@ -201,11 +201,17 @@ inline static QString escapeUnencodeable(const QTextCodec* codec, const QString& s) { QString enc_string; const int len = s.length(); + + // Workaround below: the utf8 codec reports it can't encode the second half of a surrogate + // pair, so we need to force-feed it to it + // ### this may not quite right if it's malformed, though + bool utf8 = (codec->mibEnum() == 106); + for(int i=0; i <len; ++i) { const QChar c = s[i]; - if (codec->canEncode(c)) + if (codec->canEncode(c) || (utf8 && 0xDC00 <= c && c <= 0xDFFF)) { enc_string.append(c); - else { + } else { QString ampersandEscape; ampersandEscape.sprintf("&#%u;", c.unicode()); enc_string.append(ampersandEscape);
SVN commit 929608 by orlovich: Make sure we properly group surrogate pairs when letting the codec check whether it can encode them or not, so we don't mess up non-BMP characters (which can show up on wikipedia, at least) BUG: 154142 M +36 -12 html_formimpl.cpp WebSVN link: http://websvn.kde.org/?view=rev&revision=929608
SVN commit 929611 by orlovich: Merged revision 929608: Make sure we properly group surrogate pairs when letting the codec check whether it can encode them or not, so we don't mess up non-BMP characters (which can show up on wikipedia, at least) BUG: 154142 M +36 -12 html_formimpl.cpp WebSVN link: http://websvn.kde.org/?view=rev&revision=929611
*** Bug 180416 has been marked as a duplicate of this bug. ***