Bug 42683 - failure to detect XML document encoding
Summary: failure to detect XML document encoding
Status: RESOLVED FIXED
Alias: None
Product: konqueror
Classification: Applications
Component: khtml xml (show other bugs)
Version: 3.0.1
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Konqueror Developers
URL:
Keywords:
: 19870 43428 70877 (view as bug list)
Depends on:
Blocks:
 
Reported: 2002-05-16 12:03 UTC by mortehu
Modified: 2005-04-11 08:22 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description mortehu 2002-05-16 11:58:42 UTC
(*** This bug was imported into bugs.kde.org ***)

Package:           khtml
Version:           3.0.1 (CVS >= 20020327) (using KDE 3.0.0 )
Severity:          normal
Installed from:    Compiled From Sources
Compiler:          Not Specified
OS:                Linux
OS/Compiler notes: Not Specified

KHTML doesn't seem to acknowledge the encoding-attribute of the <?xml?> tag. The following document (also available at http://www.stud.ifi.uio.no/~mortehu/utf-8.html) demonstrates this.

(The following is encoded in ISO-8859-1)

<?xml version="1.0" encoding="utf-8"?>
<html>
  <head>
    <title>UTF-8 test</title>
  </head>
  <body>
    ø  should appear &oslash;
  </body>
</html>


(Submitted via bugs.kde.org)
Comment 1 Anguo 2002-09-27 00:09:35 UTC
*** Bug 43428 has been marked as a duplicate of this bug. ***
Comment 2 Daniel Naber 2002-10-02 00:58:09 UTC
*** Bug 19870 has been marked as a duplicate of this bug. ***
Comment 3 oever 2002-12-06 14:12:11 UTC
The page http://www.stud.ifi.uio.no/~mortehu/utf-8.html has gone. Can somebody 
retrieve it and add it as an attachement for this bug? 
Comment 4 Morten Hustveit 2002-12-06 18:11:19 UTC
Sorry, I just re-uploaded it. 
Comment 5 paul 2003-07-29 12:44:09 UTC
This bug is still present in 3.0.5, or at least in the Red Hat package
kdebase-3.0.5a, specifically kdebase-3.0.5a-0.73.2:6.i386.rpm. As described, if
the encoding is set to "UTF-8", Konqueror displays characters outside the "ASCII
range" as either ISO-8859-1 or ISO-8859-15 (probably using my locale settings).
Mozilla-based browsers (such as Mozilla 1.3.x and 1.4) don't exhibit this behaviour.

(Seems that this bug has reached its voting limit or I'd vote for it as well.)
Comment 6 Thiago Macieira 2003-07-29 13:29:21 UTC
KDE 3.0 isn't developed anymore. But this bug is still present on KDE CVS HEAD. 
Comment 7 Paul Hoepfner-Homme 2004-03-27 01:38:58 UTC
Here's a practical example where this bug is very obvious in Konqueror:

http://www.catb.org/~esr/jargon/

Here are some specific pages where this bug appears:

http://www.catb.org/~esr/jargon/html/speech-style.html
http://www.catb.org/~esr/jargon/html/inarticulations.html
http://www.catb.org/~esr/jargon/html/p-convention.html (very annoying)
Comment 8 Jose Hernandez 2004-04-10 05:38:38 UTC
gcc version 3.2.3 20030422 (Gentoo Linux 1.4 3.2.3-r3, propolice)

Some xml pages can be viewed correctly, I've noticed by switching manually to UTF-8 in Konq (a quick fix until this bug is fixed). I had been trying to view roll call votes from the US Congres (http://clerk.house.gov/evs/2003/index.asp) and that lead me here. It uses a XSL page that is in UTF-8 so switching manually in Konqueror won't work, apparently. Glad to see someone has narrowed it down, at least.

These look like possible duplicates: Bug 42683, Bug 77933.
Comment 9 Seva Gluschenko 2004-08-23 12:42:53 UTC
The clearest and persistent example of XML parsing is http://market.yandex.ru/ site. While it is showing normally in Mozilla, Konqueror reports "unexpected end of data. Some information may be lost" (or smth like that, my translation from Russian).

Konqueror 3.2.x displayed the page regardless of that warning but failed to process search form on the page. Konqueror 3.3.0 even fails to display the page unless encoding is set manually to cp1251.
Comment 10 Allan Sandfeld 2004-12-08 02:32:53 UTC
CVS commit by carewolf: 

Merge encoding detection improvements from WebCore
BUG: 42683


  M +7 -0      ChangeLog   1.351
  M +16 -8     khtml_part.cpp   1.1058
  M +1 -1      ecma/xmlhttprequest.cpp   1.10
  M +179 -34   misc/decoder.cpp   1.74
  M +11 -2     misc/decoder.h   1.21



Comment 11 Mario Weilguni 2005-04-11 08:22:29 UTC
*** Bug 70877 has been marked as a duplicate of this bug. ***