Bug 108208

Summary: wrong charset encoding with html page that begins without <html>
Product: [Applications] konqueror Reporter: GML <gmludo>
Component: khtml parsingAssignee: Konqueror Developers <konq-bugs>
Status: RESOLVED UNMAINTAINED    
Severity: normal CC: kde, krase, maksim, rudo, sibskull
Priority: NOR    
Version: 3.4   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: decoder more tolerant

Description GML 2005-06-27 12:24:08 UTC
Version:           3.4.0 (using KDE 3.4.0, Debian Package 4:3.4.0-0ubuntu3.2 (3.1))
Compiler:          gcc version 3.3.5 (Debian 1:3.3.5-8ubuntu2)
OS:                Linux (i686) release 2.6.10-5-686

When some caracters are on top of html code in web page, this <meta http-equiv="Content-Type" content="text/HTML; charset=UTF-8" /> isn't read by konqueror, the charset is ISO8859-1.
Firefox read charset correctly when some caracters are on top of html.
Comment 1 Thiago Macieira 2005-06-28 03:52:05 UTC
Can you give us a test case?

Konqueror enforces correctness when searching for the <meta> tag. It must be inside <head>, for instance, so it will stop processing if it sees a non-<head> tag.
Comment 2 Sebastian Kratzert 2006-01-11 10:30:53 UTC
Every "View as HTML" link of Google can be used as test case.
One example:
http://66.102.9.104/search?q=cache:JTl6CLhE8ZcJ:www.testdaf.de/dokumente/anmeldung.pdf+test+dokument&hl=de

Sadly, they simply put a <table> before their output of their converted document, which has a correct <meta http-equiv="Content-Type"...
The table prevents Konqueror from finding the meta tags.
Comment 3 Thiago Macieira 2006-01-12 03:45:08 UTC
I can see the problem, but I'm not sure if we can fix this reasonably. 

The reasonable thing is to scan the start of document until we can be sure not to find it. Then we have to start showing it to the user. What we implement currently is scan the HTML header, which isn't shown anyways. When we see stuff to be shown, we give up.
Comment 4 Sebastian Kratzert 2006-01-12 10:37:51 UTC
Maybe we should do it like firefox.
They search in the first up to 2048 bytes of the page for a <meta> tag, containing "charset". Regardless of any other tags.
Comment 5 Tommi Tervo 2006-01-13 13:47:10 UTC
*** Bug 120036 has been marked as a duplicate of this bug. ***
Comment 6 Sebastian Kratzert 2006-01-14 19:09:21 UTC
Created attachment 14253 [details]
decoder more tolerant

This simple change makes the mentioned google pages work.
Google adds nearly 80 tags before the header, so we must allow to skip so much
tags to find the meta tag.
Comment 7 Rudo Thomas 2006-09-06 01:36:57 UTC
The proposed patch does not seem to help. Have you actually tested it? :)
Comment 8 Sebastian Kratzert 2006-09-06 23:43:44 UTC
Yes i tested it and it worked. But Google seems to have changed their pages a bit in the past.
The best thing is, they finaly put a <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> line in front of their converter output. So at least "View as html" works now ;)
Comment 9 Mike Williams 2007-01-07 19:47:14 UTC
This is still broken for me. As a test case, go ahead and open the original test case, view the html source. There are two <meta> tags specifying the encoding, one at the top added by google, and one in the original <head> section. Remove the first <meta> tag at the top of the page. Save the document locally, and open it again. Konqueror still fails to detect the second tag correctly, and defaults to the wrong charset.
Comment 10 Mike Williams 2007-01-08 03:42:08 UTC
Reproducible. Whether the konqueror devs think this should be "fixed", I'll leave up to them.

Comment 11 Janek Bevendorff 2012-06-18 17:18:28 UTC
Message from the Bugsquad and Konqueror teams: This bug is closed as outdated, as we do not have the manpower to maintain the KDE3 version anymore. If you still can reproduce this issue with Konqueror 4.8.4 or later, please open a new report. Thank you for your understanding.