Created attachment 182455 [details] Text files with emoji and other non-BMP characters inside. SUMMARY When loading UTF-8 files, KDiff3 replaces UTF-8 emoji and other non-BMP characters (with 4-byte UTF-16 representations) with the Unicode replacement character �. This is a problem when loading Unicode source files, or (more commonly in my case) when merging plaintext/Markdown files with emoji inside. STEPS TO REPRODUCE 1. Run a diff or merge between files with emojis (eg. attached files). 2. Save the result. 3. Open in a text editor. OBSERVED RESULT Both in KDiff3 and when opening the saved file, all emoji and non-BMP characters are replaced with �. In "Configure KDiff3..." > Regional Settings, all file encodings are set to "Unicode, 8 Bit (UTF-8)". Unchecking "Auto Detect" did not fix the issue. EXPECTED RESULT All Unicode characters are preserved. SOFTWARE/OS VERSIONS The bug occurs on both Linux, and macOS installed via Homebrew. Operating System: Fedora Linux 42 KDE Plasma Version: 6.3.5 KDE Frameworks Version: 6.14.0 Qt Version: 6.9.1 Kernel Version: 6.14.11-300.fc42.x86_64 (64-bit) Graphics Platform: Wayland Processors: 4 × Intel® Core™ i5-3570K CPU @ 3.40GHz Memory: 7.6 GiB of RAM Graphics Processor: Intel® HD Graphics 4000 Manufacturer: M&A Technology Product Name: DB75EN__ ADDITIONAL INFORMATION
I'll have a look but it is very likely this needs to be reported to Qt maintains as kdiff3 relies on Qt to properly handle encoding.
I checked EncodedDataStream on the 1.12 branch (I couldn't build master on Fedora because master depends on Boost newer than shipped with Fedora). The bug is that EncodedDataStream::readChar() assumes that QStringDecoder::operator()(const QByteArray &ba) will only return 1 UTF-16 character to s. But non-BMP characters are converted into two UTF-16 QChar characters, you only take the first, write it into the QChar out-parameter, and return the number of file-encoding chars consumed. You'd probably have to refactor this part of the code to either return a Unicode/UTF-32 codepoint (unsigned/int), write to a QChar[2] (set the second to 0 if unused) or resize a vector<QChar>, or something else. I don't know if there are more bugs revealed once this is fixed.
I came up with a simpler fix (for an astral character, output multiple QChar over multiple calls to EncodedDataStream::readChar()). I found you also have to override atEnd() to return false when the stream is at end but we're not done outputting characters, otherwise an emoji at the end of a file will be truncated while reading. - I'm beginning to think that inheriting from QDataStream was a leaky abstraction. Saving works fine without changes, and emoji are saved properly both within and at the end of a file. I've pushed my code to https://invent.kde.org/nyanpasu/kdiff3/-/commits/fix-emoji?ref_type=heads . I didn't create a pull request since the upstream repo seems to be set up to make them to master, but I can't build master on my Fedora machine.