Bug 473380 - XML character reference of non-BMP codepoints in SVG files are truncated
Summary: XML character reference of non-BMP codepoints in SVG files are truncated
Status: RESOLVED FIXED
Alias: None
Product: krita
Classification: Applications
Component: File formats (show other bugs)
Version: git master (please specify the git hash!)
Platform: Microsoft Windows Microsoft Windows
: NOR minor
Target Milestone: ---
Assignee: Krita Bugs
URL: https://bugreports.qt.io/browse/QTBUG...
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-14 17:51 UTC by Alvin Wong
Modified: 2023-08-18 16:14 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
test file (330 bytes, image/svg+xml)
2023-08-14 17:51 UTC, Alvin Wong
Details
Minimal reproducer code (839 bytes, text/plain)
2023-08-15 10:50 UTC, Alvin Wong
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alvin Wong 2023-08-14 17:51:28 UTC
Created attachment 160960 [details]
test file

SUMMARY
The XML loading code seems to not parse XML entities of non-BMP codepoints (e.g. `👩`) correctly, truncating the codepoint to 16-bit.

STEPS TO REPRODUCE
1. Open test file

OBSERVED RESULT
Image shows two lines of text, first line is "佔位文字 👩 Testing", the second line is the same except the emoji is replaced by the PUA character U+F469 (may appear as anything).

EXPECTED RESULT
Both lines should be "佔位文字 👩 Testing".
Comment 1 wolthera 2023-08-14 18:27:16 UTC
just a head's up, that file has the same string for me twice? Maybe firefox or something is being too helpful?

either way, it seems "⟲" will work, as will "&" and "⟳", but "👩" and " 👩" will not.

Confusing things: if I use s->text.toHtmlEscaped() when saving the text in bool KoSvgTextChunkShape::saveSvg(SvgSavingContext &context), the plain text-editor will not show me the escaped html...
Comment 2 wolthera 2023-08-14 18:32:49 UTC
Seems KoXMLWriter.cpp has its own escape function: void KoXmlWriter::addTextNode(const char* cstr),
however, the SVG parser uses QDomDocument... do we have evidence elsewhere that Qt's QDomDocument cannot parse these supplementary plane characters?
Comment 3 Alvin Wong 2023-08-14 18:45:42 UTC
(In reply to wolthera from comment #1)
> just a head's up, that file has the same string for me twice? Maybe firefox
> or something is being too helpful?

View source shows me that the second text string uses the &# entity...

(In reply to wolthera from comment #2)
> do we have evidence elsewhere
> that Qt's QDomDocument cannot parse these supplementary plane characters?

It seems some of Qt's XML parsing does handle them:
- The preview thumbnail of this SVG shows the emoji on both lines. I believe that thumbnail is generated by QtSvg.
- If you add 👩 to the SVG editor then switch to rich text, the emoji is converted correctly there, which tells us at least QXmlStreamReader is able to handle it.

But probably neither of them use QDomDocument...
Comment 4 wolthera 2023-08-14 19:00:42 UTC
Even something like &heart; leads to the xml-text-node being split O_O

It happens in KoShape *SvgParser::parseTextElement(const QDomElement &e, KoSvgTextShape *mergeIntoShape):

QDomText onlyTextChild = getTheOnlyTextChild(e);
    if (!onlyTextChild.isNull()) {
        textChunk->loadSvgTextNode(onlyTextChild, m_context);
    } else {
        QList<KoShape*> childShapes = parseContainer(e, true);
        addToGroup(childShapes, textChunk);
    }

but getTheOnlyTextChild() seems to just check the textnodes...
Comment 5 Alvin Wong 2023-08-14 19:13:50 UTC
XML only supports the entities &amp;, &lt;, &gt;, &apos;, and &quot;, I think any others are technically invalid, unless they are declared in custom DTDs or something?

Also I changed the issue subject because "character reference" is the correct name for `&#x1F469;`.
Comment 6 Alvin Wong 2023-08-15 10:46:36 UTC
It seems to be a problem that only happens when using QDomDocument::setContent  with QXmlSimpleReader and QXmlInputSource. If I use QDomDocument::setContent with QXmlStreamReader then the character reference is parsed correctly.

The former setContent overload is deprecated (and removed in Qt6), but unfortunately switching over to QXmlStreamReader is *not* an option, because we require preserving white-spaces in text nodes and QXmlStreamReader did not have an option for this until Qt 6.5. [1][2] I guess we can patch our own Qt but that means distro packages won't have the fix, unless we also get it fixed in KDE's patch collection (which we can do, I suppose).

This is where the truncation happens: https://invent.kde.org/qt/qt/qtbase/-/blob/83c69c89fb4495f2fa3d1f302c3aac509fc60313/src/xml/sax/qxml.cpp#L7464-7485

    stringAddC(QChar(tmp));

[1]: https://bugreports.qt.io/browse/QTBUG-104130
[2]: https://doc.qt.io/qt-6/xml-changes-qt6.html#spacing-only-text-nodes
Comment 7 Alvin Wong 2023-08-15 10:50:42 UTC
Created attachment 160977 [details]
Minimal reproducer code
Comment 8 Alvin Wong 2023-08-15 10:54:00 UTC
> This is where the truncation happens:
> https://invent.kde.org/qt/qt/qtbase/-/blob/
> 83c69c89fb4495f2fa3d1f302c3aac509fc60313/src/xml/sax/qxml.cpp#L7464-7485
> 
>     stringAddC(QChar(tmp));

And I think the simple fix is to change this to:

    if (Q_LIKELY(!QChar::requiresSurrogates(tmp))) {
        stringAddC(QChar(tmp));
    } else {
        stringAddC(QChar(QChar::highSurrogate(tmp)));
        stringAddC(QChar(QChar::lowSurrogate(tmp)));
    }
Comment 9 Alvin Wong 2023-08-16 08:18:43 UTC
For tracking:

- Upstream fix (merged in 2020): https://codereview.qt-project.org/c/qt/qtbase/+/298673
- Patch for our Qt fork: https://invent.kde.org/szaman/qtbase/-/merge_requests/4
- Patch for KDE's patch collection: https://invent.kde.org/qt/qt/qtbase/-/merge_requests/276