Bug 130234 - UTF-8 encoding not used for XMLHttpRequest
Summary: UTF-8 encoding not used for XMLHttpRequest
Alias: None
Product: konqueror
Classification: Applications
Component: khtml (show other bugs)
Version: unspecified
Platform: Ubuntu Linux
: NOR normal with 100 votes (vote)
Target Milestone: ---
Assignee: Konqueror Developers
Depends on:
Reported: 2006-07-04 05:32 UTC by Adam Peller
Modified: 2007-09-29 22:47 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:

Proposed patch (520 bytes, patch)
2006-10-09 16:15 UTC, Apollon Oikonomopoulos

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Peller 2006-07-04 05:32:08 UTC
Version:            (using KDE KDE 3.4.3)
Installed from:    Ubuntu Packages
OS:                Linux

KHTML's XHR does not seem to use UTF-8 decoding, by default, as the other browsers do (and as specified by the W3C working draft here: http://www.w3.org/TR/2006/WD-XMLHttpRequest-20060405/

This can be seen in the test case here:

most of the non-ASCII examples here show the strings decoded using the wrong encoding, e.g. zh-cn and zh-tw at the bottom of the page.  (Some other examples, like Korean (ko) were sent as Javascript \uxxxx escape codes and therefore render just fine)

where Dojo uses XHR to retrieve a resource which is encoded in UTF-8 (where the server specifies no encoding)  The other major browsers assume UTF-8 encoding in this case.

the files loaded via XHR can be found under

Comment 1 Apollon Oikonomopoulos 2006-10-09 16:15:32 UTC
Created attachment 18068 [details]
Proposed patch

I confirm the same behaviour. Unfortunately, for speakers of languages with
non-latin alphabets, this can be a bit of a problem, since AJAX replies are now
by default rendered using iso8859-1, ending up completely jammed on screen.

UTF-8 or UTF-16 marked by BOM are in general the standard encodings for XML
documents, when no explicit encoding specification is present. Apart from that,
the W3C specification of XML 1.0 ( http://www.w3.org/TR/xml/#charencoding)
mandates that UTF-16 encoded XML documents be always marked with a BOM, whereas
UTF-8 may optionally have a BOM.  IMHO it should default to UTF-8, since this
is the expected behaviour by most web applications. Since
khtml::Decoder::decode always looks for a BOM at the beginning of the stream,
setting the default encoding of XMLHttpRequest replies to UTF-8 guarantees that
it will always work with UTF-8, UTF-8 w/ BOM and UTF-16 w/ BOM.

I'm not familiar with the internals of KDE, but the following patch fixes the
issue for me. Still i'm not sure about the use of the Decoder::DefaultEncoding
constant or whether something else should be used instead.

Comment 2 George T 2006-10-10 17:16:41 UTC
*** This bug has been confirmed by popular vote. ***
Comment 3 Adam Peller 2006-10-10 19:34:38 UTC
Also, please note that content other than XML may be passed over XHR.  In Dojo's case, we pass JS which we eval, so putting a BOM at the top is not an option.  We did something far uglier for a workaround...

"it seems like would be able to get away with: /* <?xml version="1.0" encoding="UTF-8" ?> */ in the top of your translation files"

Which appears to work as a side effect of the parser sniffing for encoding headers.
Comment 4 Daniel Hahler 2007-03-22 22:51:23 UTC
Can the patch get reviewed and approved for 3.5.7?
Comment 5 Igor 2007-09-18 19:31:07 UTC
I've tried Kubuntu 7.04 right now and it seems this bug is fixed in it, while in my ArchLinux - not.
Comment 6 Dawit Alemayehu 2007-09-29 22:47:52 UTC
r718830 | adawit | 2007-09-29 16:20:38 -0400 (Sat, 29 Sep 2007) | 5 lines

* Default to "UTF-8" per section 2 of the draft W3C "The XMLHttpRequest Object" specification. Fixes BR# 130234


Index: xmlhttprequest.cpp
--- xmlhttprequest.cpp  (revision 657077)
+++ xmlhttprequest.cpp  (revision 718830)
@@ -674,7 +674,8 @@
     if (!encoding.isNull())
       decoder->setEncoding(encoding.latin1(), Decoder::EncodingFromHTTPHeader);
     else {
-      // FIXME: Inherit the default encoding from the parent document?
+      // Per section 2 of W3C working draft spec, fall back to "UTF-8".
+      decoder->setEncoding("UTF-8", Decoder::DefaultEncoding);
   if (len == 0)