Bug 130234

Summary: UTF-8 encoding not used for XMLHttpRequest
Product: [Applications] konqueror Reporter: Adam Peller <adam+kdebugs>
Component: khtmlAssignee: Konqueror Developers <konq-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: maksim
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Ubuntu   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: Proposed patch

Description Adam Peller 2006-07-04 05:32:08 UTC
Version:            (using KDE KDE 3.4.3)
Installed from:    Ubuntu Packages
OS:                Linux

KHTML's XHR does not seem to use UTF-8 decoding, by default, as the other browsers do (and as specified by the W3C working draft here: http://www.w3.org/TR/2006/WD-XMLHttpRequest-20060405/

This can be seen in the test case here:
http://archive.dojotoolkit.org/nightly/tests/i18n/test_strings.html

most of the non-ASCII examples here show the strings decoded using the wrong encoding, e.g. zh-cn and zh-tw at the bottom of the page.  (Some other examples, like Korean (ko) were sent as Javascript \uxxxx escape codes and therefore render just fine)

where Dojo uses XHR to retrieve a resource which is encoded in UTF-8 (where the server specifies no encoding)  The other major browsers assume UTF-8 encoding in this case.

the files loaded via XHR can be found under

http://archive.dojotoolkit.org/nightly/tests/i18n/nls/*/salutations.js
Comment 1 Apollon Oikonomopoulos 2006-10-09 16:15:32 UTC
Created attachment 18068 [details]
Proposed patch

I confirm the same behaviour. Unfortunately, for speakers of languages with
non-latin alphabets, this can be a bit of a problem, since AJAX replies are now
by default rendered using iso8859-1, ending up completely jammed on screen.

UTF-8 or UTF-16 marked by BOM are in general the standard encodings for XML
documents, when no explicit encoding specification is present. Apart from that,
the W3C specification of XML 1.0 ( http://www.w3.org/TR/xml/#charencoding)
mandates that UTF-16 encoded XML documents be always marked with a BOM, whereas
UTF-8 may optionally have a BOM.  IMHO it should default to UTF-8, since this
is the expected behaviour by most web applications. Since
khtml::Decoder::decode always looks for a BOM at the beginning of the stream,
setting the default encoding of XMLHttpRequest replies to UTF-8 guarantees that
it will always work with UTF-8, UTF-8 w/ BOM and UTF-16 w/ BOM.

I'm not familiar with the internals of KDE, but the following patch fixes the
issue for me. Still i'm not sure about the use of the Decoder::DefaultEncoding
constant or whether something else should be used instead.

Cheers,
Apollon
Comment 2 George T 2006-10-10 17:16:41 UTC
*** This bug has been confirmed by popular vote. ***
Comment 3 Adam Peller 2006-10-10 19:34:38 UTC
Also, please note that content other than XML may be passed over XHR.  In Dojo's case, we pass JS which we eval, so putting a BOM at the top is not an option.  We did something far uglier for a workaround...

"it seems like would be able to get away with: /* <?xml version="1.0" encoding="UTF-8" ?> */ in the top of your translation files"

Which appears to work as a side effect of the parser sniffing for encoding headers.
Comment 4 Daniel Hahler 2007-03-22 22:51:23 UTC
Can the patch get reviewed and approved for 3.5.7?
Comment 5 Igor 2007-09-18 19:31:07 UTC
I've tried Kubuntu 7.04 right now and it seems this bug is fixed in it, while in my ArchLinux - not.
Comment 6 Dawit Alemayehu 2007-09-29 22:47:52 UTC
r718830 | adawit | 2007-09-29 16:20:38 -0400 (Sat, 29 Sep 2007) | 5 lines

* Default to "UTF-8" per section 2 of the draft W3C "The XMLHttpRequest Object" specification. Fixes BR# 130234

BUG:130234

Index: xmlhttprequest.cpp
===================================================================
--- xmlhttprequest.cpp  (revision 657077)
+++ xmlhttprequest.cpp  (revision 718830)
@@ -674,7 +674,8 @@
     if (!encoding.isNull())
       decoder->setEncoding(encoding.latin1(), Decoder::EncodingFromHTTPHeader);
     else {
-      // FIXME: Inherit the default encoding from the parent document?
+      // Per section 2 of W3C working draft spec, fall back to "UTF-8".
+      decoder->setEncoding("UTF-8", Decoder::DefaultEncoding);
     }
   }
   if (len == 0)