Created attachment 160960 [details] test file SUMMARY The XML loading code seems to not parse XML entities of non-BMP codepoints (e.g. `👩`) correctly, truncating the codepoint to 16-bit. STEPS TO REPRODUCE 1. Open test file OBSERVED RESULT Image shows two lines of text, first line is "佔位文字 👩 Testing", the second line is the same except the emoji is replaced by the PUA character U+F469 (may appear as anything). EXPECTED RESULT Both lines should be "佔位文字 👩 Testing".
just a head's up, that file has the same string for me twice? Maybe firefox or something is being too helpful? either way, it seems "⟲" will work, as will "&" and "⟳", but "👩" and " 👩" will not. Confusing things: if I use s->text.toHtmlEscaped() when saving the text in bool KoSvgTextChunkShape::saveSvg(SvgSavingContext &context), the plain text-editor will not show me the escaped html...
Seems KoXMLWriter.cpp has its own escape function: void KoXmlWriter::addTextNode(const char* cstr), however, the SVG parser uses QDomDocument... do we have evidence elsewhere that Qt's QDomDocument cannot parse these supplementary plane characters?
(In reply to wolthera from comment #1) > just a head's up, that file has the same string for me twice? Maybe firefox > or something is being too helpful? View source shows me that the second text string uses the &# entity... (In reply to wolthera from comment #2) > do we have evidence elsewhere > that Qt's QDomDocument cannot parse these supplementary plane characters? It seems some of Qt's XML parsing does handle them: - The preview thumbnail of this SVG shows the emoji on both lines. I believe that thumbnail is generated by QtSvg. - If you add 👩 to the SVG editor then switch to rich text, the emoji is converted correctly there, which tells us at least QXmlStreamReader is able to handle it. But probably neither of them use QDomDocument...
Even something like &heart; leads to the xml-text-node being split O_O It happens in KoShape *SvgParser::parseTextElement(const QDomElement &e, KoSvgTextShape *mergeIntoShape): QDomText onlyTextChild = getTheOnlyTextChild(e); if (!onlyTextChild.isNull()) { textChunk->loadSvgTextNode(onlyTextChild, m_context); } else { QList<KoShape*> childShapes = parseContainer(e, true); addToGroup(childShapes, textChunk); } but getTheOnlyTextChild() seems to just check the textnodes...
XML only supports the entities &, <, >, ', and ", I think any others are technically invalid, unless they are declared in custom DTDs or something? Also I changed the issue subject because "character reference" is the correct name for `👩`.
It seems to be a problem that only happens when using QDomDocument::setContent with QXmlSimpleReader and QXmlInputSource. If I use QDomDocument::setContent with QXmlStreamReader then the character reference is parsed correctly. The former setContent overload is deprecated (and removed in Qt6), but unfortunately switching over to QXmlStreamReader is *not* an option, because we require preserving white-spaces in text nodes and QXmlStreamReader did not have an option for this until Qt 6.5. [1][2] I guess we can patch our own Qt but that means distro packages won't have the fix, unless we also get it fixed in KDE's patch collection (which we can do, I suppose). This is where the truncation happens: https://invent.kde.org/qt/qt/qtbase/-/blob/83c69c89fb4495f2fa3d1f302c3aac509fc60313/src/xml/sax/qxml.cpp#L7464-7485 stringAddC(QChar(tmp)); [1]: https://bugreports.qt.io/browse/QTBUG-104130 [2]: https://doc.qt.io/qt-6/xml-changes-qt6.html#spacing-only-text-nodes
Created attachment 160977 [details] Minimal reproducer code
> This is where the truncation happens: > https://invent.kde.org/qt/qt/qtbase/-/blob/ > 83c69c89fb4495f2fa3d1f302c3aac509fc60313/src/xml/sax/qxml.cpp#L7464-7485 > > stringAddC(QChar(tmp)); And I think the simple fix is to change this to: if (Q_LIKELY(!QChar::requiresSurrogates(tmp))) { stringAddC(QChar(tmp)); } else { stringAddC(QChar(QChar::highSurrogate(tmp))); stringAddC(QChar(QChar::lowSurrogate(tmp))); }
For tracking: - Upstream fix (merged in 2020): https://codereview.qt-project.org/c/qt/qtbase/+/298673 - Patch for our Qt fork: https://invent.kde.org/szaman/qtbase/-/merge_requests/4 - Patch for KDE's patch collection: https://invent.kde.org/qt/qt/qtbase/-/merge_requests/276
Fixed by Qt patch. https://invent.kde.org/graphics/krita/-/commit/2dbc54ea58043392b4c9443006e8820a00da30ad