Version: (using KDE 4.3.2) OS: Linux Installed from: Debian testing/unstable Packages Unicode entities with a size of five digits (I only checked those in the SMP, that is, starting with 1) are not displayed correctly if specified as XML entities. The last four digits are used to select a character from the BMP instead, resulting in meaningless text. Steps to reproduce: 1. Go to http://www.alanwood.net/unicode/egyptian-hieroglyphs.html 2. Instead of hieroglyphs (as should be shown if one had the Aegyptian font installed), japanese letters and other characters are shown instead. 3. Go to http://en.wikipedia.org/wiki/Kana#Kana_in_Unicode , and compare the symbols shown in (1) with these, it can be seen that (1)'s 13041 corresponds to (3)'s 3041, 13042 with 3042, etc. 3. Go to http://en.wikipedia.org/wiki/Gothic_alphabet , it can be seen that gothic letters, which are also in the SMP, are displayed correctly (compare against the images), by looking at the source it can be seen that they are not XML entities but rather straight unicode. This only shows the bug for Egyptian hieroglyphs, but it's probably the case for all non-BMP code points. No plugins loaded.
I believe the problem is in the file khtml/htmltokenizer.cpp , where this code is found: case Hexadecimal: { int uc = EntityChar.unicode(); int ll = qMin<uint>(src.length(), 8); while(ll--) { QChar csrc(src->toLower()); cc = csrc.cell(); if(csrc.row() || !((cc >= '0' && cc <= '9') || (cc >= 'a' && cc <= 'f'))) { break; } uc = uc*16 + (cc - ( cc < 'a' ? '0' : 'a' - 10)); cBuffer[cBufferPos++] = cc; ++src; } EntityChar = QChar(uc); Entity = SearchSemicolon; break; } case Decimal: { int uc = EntityChar.unicode(); int ll = qMin(src.length(), 9-cBufferPos); while(ll--) { cc = src->cell(); if(src->row() || !(cc >= '0' && cc <= '9')) { Entity = SearchSemicolon; break; } uc = uc * 10 + (cc - '0'); cBuffer[cBufferPos++] = cc; ++src; } EntityChar = QChar(uc); if(cBufferPos == 9) Entity = SearchSemicolon; break; } I think this code should generate two QChar in the case of unicode codepoints not in the Basic Multilingual Plane. Furthermore, I believe uc should be an unsigned int.
I can confirm this. Testcase: http://www.yaronet.com/posts.php?s=130411 (should show a reversed B, assuming you have a font which covers Deseret, such as G. Douros's Analecta (gdouros-analecta-fonts in Fedora)). (No, I'm not interested in Mormon liturgy at all, I just picked that character because it showed up in kernel.org's April Fools joke. ;-) )
Thank you for the bug report. As this report hasn't seen any changes in 10 years or more, we ask if you can please confirm that the issue still persists. If this bug is no longer persisting or relevant please change the status to resolved.
Dear user, KHTML (and KJS) was a long time more or less unmaintained and got removed in KF6. Please migrate to use a QWebEngine based HTML component. We will do no further fixes or improvements to the KF5 branches of these components beside important security fixes. For security issues, please see: https://kde.org/info/security/ Sorry that we did not fix this issue during the life-time of KHTML. Greetings Christoph Cullmann