Version: (using KDE KDE 3.2.1) Installed from: Gentoo Packages Compiler: gcc 3.3 CXXFLAGS=CFLAGS=-O3 -march=athlon-xp OS: Linux This is quite important to me, because I need full unicode support for daily computer work; especially the case with German umlauts and Japanese characters on the same webpage/in the same text. When posting into a forum today, this was exactly what I had. Unfortunately though, the resulting text contained the KDE-typical question marks instead of the Kanji I entered. A quick investigation revealed that the encoding of the HTML file containing the post form was set to ISO-8859-1 (http-equiv tag). This led to the conclusion that Konqueror/KHTML interprets all text entered by the user must be in the same encoding as the page. Even forcing the page to use UTF-8 (via View->Set Encoding) had no effect. When I tried the same with Mozilla 1.6, it silently converted the non-iso8859-text into unicode entities as it should (Mozilla was set to UTF-8 mode explicitly). I assume this feature is available in other browsers too (haven't tried yet though), so I think you might consider adding into Konquerorvery soon.
I can second this request, and I think it's rather a severe bug than just a pure wish. At least mozilla and opera convert the accents that do not fit into the charset of the page containing the form to unicode entities, so e.g. if the page is iso-8859-1 then an euro-sign becomes "€" (which, in case of a GET submit type, gets even further escaped to %26%238364%3B in the URL). This way no information is lost, the user will most likely see exactly the same character he entered on the resulting page (unless the server explicitely does something for things to go bad). The current konqueror behavior leads to data loss without notifying the user, and furthermore assumes some character set knowledge from the user, which it shouldn't, users shouldn't need to know what iso-8859-1 or unicode or utf-8 are, they should only see things working properly.
Sorry, forgot to say, I have KDE 3.3.
Wow, someone saw that bug report at last! I hope I can point the KDE developers' attention to it, so I we won't have to wait much longer for a fix.
Created an attachment (id=7837) [details] Patch for ampersand-escaping characters The attached patch should be applied to khtml/html. I have not fully tested it, but try and see if it helps.
Works nice for me, thanks. Just a small question... I'm not familiar with the Unicode handling of Qt, but .unicode() returning an unsigned short really shocked me, as Unicode definitely has characters above 65536. Is there anything known how this will be handled in future versions of Qt? Will a new function be introduced, or the return value of .unicode() extended to unsigned long? In the latter case, I'd recommend to use ampersandEscape.sprintf("&#%lu;", (unsigned long)(c.unicode())); instead of ampersandEscape.sprintf("&#%hu;", c.unicode()); it doesn't hurt, but works better if Qt changes to ucs4.
CVS commit by carewolf: Ampersand-escape otherwise unencodable characters. Matches Gecko behavior. FEATURE:82018 M +6 -2 ChangeLog 1.303 M +25 -7 html/html_formimpl.cpp 1.387 [POSSIBLY UNSAFE: printf] --- kdelibs/khtml/ChangeLog #1.302:1.303 @@ -1,2 +1,6 @@ +2004-10-16 Allan Sandfeld Jensen <kde@carewolf.com> + * html/html_formimpl.cpp: Escape otherwise unencodable characters. + Matches the behavior of Gecko. + 2004-10-15 Stephan Kulow <coolo@kde.org> @@ -4,5 +8,5 @@ got items when we calculate a height for items (#87466) - * css/html4.css: changing default horizontal margins for H1-H6 from + * css/html4.css: changing default horizontal margins for H1-H6 from auto to 0 (#91327) @@ -33,5 +37,5 @@ * rendering/render_block.cpp (layoutBlockChildren): simpler implementation for compact display: do not insert the - compact child within the next block anymore. + compact child within the next block anymore. Solves lot of problems with host blocks having non-inline children. --- kdelibs/khtml/html/html_formimpl.cpp #1.386:1.387 @@ -174,7 +174,25 @@ static QCString encodeCString(const QCSt } +// ### This function only encodes to numeric ampersand escapes, +// ### we could use standard ampersand values as well. +inline static QString escapeUnencodeable(const QTextCodec* codec, const QString& s) { + QString enc_string; + int len = s.length(); + for(int i=0; i <len; i++) { + QChar c = s[i]; + if (codec->canEncode(c)) + enc_string.append(c); + else { + QString ampersandEscape; + ampersandEscape.sprintf("&#%u;", c.unicode()); + enc_string.append(ampersandEscape); + } + } + return enc_string; +} + inline static QCString fixUpfromUnicode(const QTextCodec* codec, const QString& s) { - QCString str = codec->fromUnicode(s); + QCString str = codec->fromUnicode(escapeUnencodeable(codec,s)); str.truncate(str.length()); return str;
Thanks a lot for the fix! I'll try it on the next KDE upgrade.
You need to log in before you can comment on or make changes to this bug.