Bug 199657 - Wrong encoding detection for web-page
Summary: Wrong encoding detection for web-page
Status: RESOLVED WORKSFORME
Alias: None
Product: konqueror
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: Konqueror Developers
URL:
Keywords:
: 212631 (view as bug list)
Depends on:
Blocks:
 
Reported: 2009-07-10 12:52 UTC by Andrey Cherepanov
Modified: 2010-01-29 14:15 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
incorrect encoding (81.97 KB, image/png)
2009-07-22 14:54 UTC, anton
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey Cherepanov 2009-07-10 12:52:02 UTC
Version:           4.2.4 (KDE 4.2.4) (using 4.2.4 (KDE 4.2.4), ALT Linux i586)
Compiler:          gcc
OS:                Linux (i686) release 2.6.30-std-def-alt1

Possible, regression from 4.2.3. Set Encoding - Default. When I open web-page (ex. http://www.linux.org.ru/view-message.jsp?msgid=3845915&lastmod=1246883309772&page=5) from bookmark in Konqueror it shows in incorrect encoding (possible, Default encoding cp1251 instead UTF-8 from HTTP Header). After page reload it displays correctly.
Comment 1 Maksim Orlovich 2009-07-10 15:49:13 UTC
Hmm, works here, with a 4.2.4'ish build.
Comment 2 anton 2009-07-22 14:54:23 UTC
Created attachment 35545 [details]
incorrect encoding

Same here - Version 4.2.96 (KDE 4.2.96 (KDE 4.3 RC2)) "release 142"

Were ok in 1st 4.3 2 betas and probably in rc1.

Problem appears only after loading recently visited pages

1. Open google.com
2. Type "hello" and press "search" - the result would look ok
3. Without leaving search result page press "search" again - the result would look like on the attached screenshot.
4. Press "reload" - the page would become ok again

This happens also on other sites.

Also I am located in Russia, so google returns me results in Russian language which are displayed incorrectly - on the English-only web page the problem would not be visible.
Comment 3 Matt Whitlock 2009-08-01 14:44:01 UTC
This problem is also affecting 4.2.98 (4.3 RC3).

Some more information:

When visiting a certain web page, the server sends back these response headers:

HTTP/1.1 200 OK
Date: Sat, 01 Aug 2009 12:24:12 GMT
Server: Apache/2.2.3 (Red Hat)     
Cache-control: must-revalidate     
Content-Length: 75636              
Set-Cookie: ** blanked out for privacy **
Vary: Accept-Encoding,User-Agent                                                
Connection: close                                                               
Content-Type: text/html;charset=utf-8

Konqueror does not respect the "charset" subheader and chooses to display the page in ISO-8859-1, which results in lots of garbage characters.  Interestingly, the "View Document Information" dialog (from the "View" menu) shows nothing in "HTTP Headers."

If the encoding is forced to "Unicode|Autodetect," the page reloads and displays correctly, and then the document information shows all the HTTP response headers as it should.

Interestingly, at this point the encoding can be set back to "Default," and the page will still display correctly, and the document information still shows the HTTP response headers.  However, navigating to another page by clicking a link results in the incorrect behavior again, and attempting to switch back to "Unicode|Autodetect" no longer corrects the problem.
Comment 4 Matt Whitlock 2009-08-01 14:56:12 UTC
Come to think of it, this is probably related to bug 200789, the problem where Konqueror does not process HTTP 30* redirects.  Maybe KIO was/is not handing off the HTTP response headers reliably?
Comment 5 Igor Strelnikoff 2009-11-25 21:03:28 UTC
*** Bug 212631 has been marked as a duplicate of this bug. ***
Comment 6 Igor Strelnikoff 2009-11-25 21:06:23 UTC
This problem is also present in 4.3.3
Comment 7 Igor Strelnikoff 2009-11-25 21:14:36 UTC
locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=

steps for reproducing the bug:

1. open tab with url
"http://www.linux.org.ru/view-message.jsp?msgid=4193884&lastmod=1257189936108"
konqueror correctly sets chrasterset to "UTF-8"
2. open new tab with the same url (as above)
konqueror sets charasetrset to "ISO-8859-1"
Comment 8 RGBl 2009-12-16 23:41:30 UTC
I can reproduce the problem with link in comment #7: sometimes it loads OK, most of the time not, always a reload fix the rendering.
This also happens with the Spanish characters on my website. For example it shows "pingüinos" instead of "pingüinos", "Fotografía" instead of "Fotografía" and a lot of "Â" here and there. As with the Russian site, sometimes it loads OK, most of the time fails, a reload always fix it.
Konqueror 4.3.1 release 6 on openSUSE 11.2 64 bits. Encoding is set to "predefined"
No problem with rekonq nor firefox.
Comment 9 Andrey Cherepanov 2009-12-17 11:09:12 UTC
There is no problem with encoding in Konqueror 4.3.4. Please, check this version and I will close this bug.
Comment 10 Māris Nartišs 2010-01-29 14:15:17 UTC
(In reply to comment #9)
> There is no problem with encoding in Konqueror 4.3.4. Please, check this
> version and I will close this bug.

I'm still able to reproduce this issue with 4.3.5 (somebody with enough carma, please, reopen this bug). Problem is that second time Konqueror takes page from it's cache. Any page, which head element lacks META tag with encoding, gets default encoding instead of one provided by web server, as original page HTTP header is lost. 

Solution - Konqueror's cache mechanism should also cache HTTP header data not only page content.