Bug 505879 - KDiff3 corrupts Unicode characters/emoji not in UTF-16 BMP
Summary: KDiff3 corrupts Unicode characters/emoji not in UTF-16 BMP
Status: REPORTED
Alias: None
Product: kdiff3
Classification: Applications
Component: application (other bugs)
Version First Reported In: 1.12.3
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: michael
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-06-20 23:58 UTC by nyanpasu64
Modified: 2025-09-15 09:59 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:


Attachments
Text files with emoji and other non-BMP characters inside. (845 bytes, application/zip)
2025-06-20 23:58 UTC, nyanpasu64
Details

Note You need to log in before you can comment on or make changes to this bug.
Description nyanpasu64 2025-06-20 23:58:32 UTC
Created attachment 182455 [details]
Text files with emoji and other non-BMP characters inside.

SUMMARY
When loading UTF-8 files, KDiff3 replaces UTF-8 emoji and other non-BMP characters (with 4-byte UTF-16 representations) with the Unicode replacement character �.
This is a problem when loading Unicode source files, or (more commonly in my case) when merging plaintext/Markdown files with emoji inside.

STEPS TO REPRODUCE
1. Run a diff or merge between files with emojis (eg. attached files).
2. Save the result.
3. Open in a text editor.

OBSERVED RESULT
Both in KDiff3 and when opening the saved file, all emoji and non-BMP characters are replaced with �.

In "Configure KDiff3..." > Regional Settings, all file encodings are set to "Unicode, 8 Bit (UTF-8)". Unchecking "Auto Detect" did not fix the issue.

EXPECTED RESULT
All Unicode characters are preserved.

SOFTWARE/OS VERSIONS
The bug occurs on both Linux, and macOS installed via Homebrew.
Operating System: Fedora Linux 42
KDE Plasma Version: 6.3.5
KDE Frameworks Version: 6.14.0
Qt Version: 6.9.1
Kernel Version: 6.14.11-300.fc42.x86_64 (64-bit)
Graphics Platform: Wayland
Processors: 4 × Intel® Core™ i5-3570K CPU @ 3.40GHz
Memory: 7.6 GiB of RAM
Graphics Processor: Intel® HD Graphics 4000
Manufacturer: M&A Technology
Product Name: DB75EN__

ADDITIONAL INFORMATION
Comment 1 michael 2025-07-23 16:26:24 UTC
I'll have a look but it is very likely this needs to be reported to Qt maintains as kdiff3 relies on Qt to properly handle encoding.
Comment 2 nyanpasu64 2025-08-06 23:13:14 UTC
I checked EncodedDataStream on the 1.12 branch (I couldn't build master on Fedora because master depends on Boost newer than shipped with Fedora). The bug is that EncodedDataStream::readChar() assumes that QStringDecoder::operator()(const QByteArray &ba) will only return 1 UTF-16 character to s. But non-BMP characters are converted into two UTF-16 QChar characters, you only take the first, write it into the QChar out-parameter, and return the number of file-encoding chars consumed.

You'd probably have to refactor this part of the code to either return a Unicode/UTF-32 codepoint (unsigned/int), write to a QChar[2] (set the second to 0 if unused) or resize a vector<QChar>, or something else. I don't know if there are more bugs revealed once this is fixed.
Comment 3 nyanpasu64 2025-08-06 23:52:21 UTC
I came up with a simpler fix (for an astral character, output multiple QChar over multiple calls to EncodedDataStream::readChar()). I found you also have to override atEnd() to return false when the stream is at end but we're not done outputting characters, otherwise an emoji at the end of a file will be truncated while reading.

- I'm beginning to think that inheriting from QDataStream was a leaky abstraction.

Saving works fine without changes, and emoji are saved properly both within and at the end of a file.

I've pushed my code to https://invent.kde.org/nyanpasu/kdiff3/-/commits/fix-emoji?ref_type=heads . I didn't create a pull request since the upstream repo seems to be set up to make them to master, but I can't build master on my Fedora machine.