Bug 102307 - Regression: Bad encoding detection
Summary: Regression: Bad encoding detection
Status: RESOLVED NOT A BUG
Alias: None
Product: konqueror
Classification: Applications
Component: khtml parsing (show other bugs)
Version: unspecified
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Konqueror Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-03-23 19:58 UTC by Sebastien
Modified: 2008-12-04 23:14 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastien 2005-03-23 19:58:34 UTC
Version:            (using KDE KDE 3.4.0)
Installed from:    Compiled From Sources

Accentued characters here : http://dukez.patapouf.org:8000/blog/2005/ are not good.

My system encoding is ISO-8859-15.
The page encoding is ISO-8859-15.
The XML specifie encoding to ISO-8859-15 (line 2).

And the automatic encoding discovery still display it as UTF-8 or another strange encoding (it mangle on character after each accentued chars: so it assume special characters to be 2 bytes: UTF8 ?).

Is it specific to XML encoding?
Is it a regression since last KDE?
Comment 1 Allan Sandfeld 2005-03-23 20:13:18 UTC
It is because your page specifies UTF-8 to be the encoding:
<?xml version="1.0" encoding="UTF-8"?>

KDE 3.3 could not read xml-headers, but 3.4 can, and now follows your directive.
Comment 2 Sebastien 2005-03-23 20:49:32 UTC
Sorry,

My friend changed the header AFTER I posted the bug and BEFORE you visited it.

So, at first he haven't included any encoding and it was interpreted as UTF-8!
Then, he changed to this:
<?xml version="1.0" encoding="ISO-8859-1"?>
and it doesn't work.
(then he changed to UTF-8, what was not working, of course)
And he changed back to encoding="ISO-8859-1" but it doesn't work.

Can this bug be reopened?

Perhapse it's because there is a blank line before the xml header?
Comment 3 Allan Sandfeld 2005-03-23 20:58:15 UTC
Sure
Comment 4 Allan Sandfeld 2005-03-23 21:10:44 UTC
Okay I found the source of the problem then. The web-server claims that the encoding is "UTF-8" by sending this HTTP header:
Content-type: text/html; charset="utf-8"

So we could fix it by letting the XML specified encoding take preference over HTTP specified ones. Of course you should fix the webserver under all circumstances.

How does other browsers react? (I want to know the preferences of IE, Mozilla and Opera)
Comment 5 Sebastien 2005-03-23 21:37:02 UTC
That's a tie.

Mozilla interpret it as ISO-8859-1. Pretty well.
Opera intepret it as UTF-8. Bad.

I don't want to reboot now to test with IE.
Anybody?
Comment 6 Thiago Macieira 2005-03-24 02:27:43 UTC
If I'm not mistaken, the recommended order is XML header -> HTTP header -> META tag.
Comment 7 Dotan Cohen 2008-07-27 21:44:53 UTC
So far as I've understood, the recommended order for web browsers is HTTP header -> that's it. You can't even read the page if the HTTP header doesn't specify what encoding it is in, and parsing for an ASCII-compatible document is not the web browser's job.

The XML header is for XML parsers that are _not_ web browsers, which could have any reason for parsing the page. The META tag is only used for pages saved locally, in which there is no HTTP header.
Comment 8 Dotan Cohen 2008-12-04 23:14:07 UTC
This bug is invalid, as outlined in comment #7.