Bug 218954 - Broken interpretation of Unicode XML entities beyond BMP (SMP)
Summary: Broken interpretation of Unicode XML entities beyond BMP (SMP)
Status: CONFIRMED
Alias: None
Product: konqueror
Classification: Applications
Component: khtml (show other bugs)
Version: unspecified
Platform: Debian testing Linux
: NOR normal
Target Milestone: ---
Assignee: Konqueror Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-12-16 17:04 UTC by MD
Modified: 2021-03-21 00:25 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description MD 2009-12-16 17:04:12 UTC
Version:            (using KDE 4.3.2)
OS:                Linux
Installed from:    Debian testing/unstable Packages

Unicode entities with a size of five digits (I only checked those in the
SMP, that is, starting with 1) are not displayed correctly if specified
as XML entities. The last four digits are used to select a character
from the BMP instead, resulting in meaningless text.

Steps to reproduce:
1. Go to http://www.alanwood.net/unicode/egyptian-hieroglyphs.html
2. Instead of hieroglyphs (as should be shown if one had the Aegyptian font
installed), japanese letters and other characters are shown instead.
3. Go to http://en.wikipedia.org/wiki/Kana#Kana_in_Unicode , and compare
the symbols shown in (1) with these, it can be seen that (1)'s 13041
corresponds to (3)'s 3041, 13042 with 3042, etc.
3. Go to http://en.wikipedia.org/wiki/Gothic_alphabet , it can be seen
that gothic letters, which are also in the SMP, are displayed correctly
(compare against the images), by looking at the source it can be seen
that they are not XML entities but rather straight unicode.

This only shows the bug for Egyptian hieroglyphs, but it's probably the
case for all non-BMP code points.

No plugins loaded.
Comment 1 MD 2009-12-17 20:41:08 UTC
I believe the problem is in the file khtml/htmltokenizer.cpp , where this code is found:

        case Hexadecimal:
        {
            int uc = EntityChar.unicode();
            int ll = qMin<uint>(src.length(), 8);
            while(ll--) {
                QChar csrc(src->toLower());
                cc = csrc.cell();

                if(csrc.row() || !((cc >= '0' && cc <= '9') || (cc >= 'a' && cc <= 'f'))) {
                    break;
                }
                uc = uc*16 + (cc - ( cc < 'a' ? '0' : 'a' - 10));
                cBuffer[cBufferPos++] = cc;
                ++src;
            }
            EntityChar = QChar(uc);
            Entity = SearchSemicolon;
            break;
        }
        case Decimal:
        {
            int uc = EntityChar.unicode();
            int ll = qMin(src.length(), 9-cBufferPos);
            while(ll--) {
                cc = src->cell();

                if(src->row() || !(cc >= '0' && cc <= '9')) {
                    Entity = SearchSemicolon;
                    break;
                }

                uc = uc * 10 + (cc - '0');
                cBuffer[cBufferPos++] = cc;
                ++src;
            }
            EntityChar = QChar(uc);
            if(cBufferPos == 9)  Entity = SearchSemicolon;
            break;
        }

I think this code should generate two QChar in the case of unicode codepoints not in the Basic Multilingual Plane. Furthermore, I believe uc should be an unsigned int.
Comment 2 Kevin Kofler 2010-04-02 11:54:47 UTC
I can confirm this.

Testcase: http://www.yaronet.com/posts.php?s=130411 (should show a reversed B, assuming you have a font which covers Deseret, such as G. Douros's Analecta (gdouros-analecta-fonts in Fedora)).

(No, I'm not interested in Mormon liturgy at all, I just picked that character because it showed up in kernel.org's April Fools joke. ;-) )
Comment 3 Justin Zobel 2021-03-21 00:25:11 UTC
Thank you for the bug report.

As this report hasn't seen any changes in 10 years or more, we ask if you can please confirm that the issue still persists.

If this bug is no longer persisting or relevant please change the status to resolved.