SUMMARY When copying from Okular, all characters will be completely decomposed, i.e. both canonical decomposition and compatibility decomposition will be applied (see the Unicode Standard). While canonical decompostion is fine at most time, compatibility decomposition is not always desired, since some formatting information will be lost. It is especially a problem for punctuations in Chinese, because we almost always use fullwidth characters, but they are defined as compatibility decomposable to their ASCII counterparts, which are almost never used (unless when mixed with Latin scripts, and those ASCII counterparts are used exclusively for them). STEPS TO REPRODUCE 1. Create any text file with the content "你好,世界!" 2. Open it with Okular 3. Copy the content OBSERVED RESULT The copied result is "你好,世界!", where "," (U+FF0C) is turned into "," (U+U+002C) and "!" (U+FF01) is turned into "!" (U+0021). EXPECTED RESULT We should have an option to turn off compatibility decomposition (or canonical decomposition, just in case) and the content should be copied as-is. SOFTWARE/OS VERSIONS Linux: Ubuntu 23.04 KDE Plasma Version: 5.27.4 KDE Frameworks Version: 5.104.0 Qt Version: 5.15.8 ADDITIONAL INFORMATION The version provided by Ubuntu may be a little old, but that shouldn't matter.
A possibly relevant merge request was started @ https://invent.kde.org/graphics/okular/-/merge_requests/941
Git commit 322fd2d54e4226f6dbb4fb357a86931a5c790340 by Sune Vuorela, on behalf of Wendi Gan. Committed on 24/05/2024 at 10:02. Pushed by sune into branch 'master'. fix Unicode Normalization: replace NFKC to NFC Use NFC in copy, makeWord, and export functions, and NFKC for search operations. NFKC may alter characters when copied or exported. For example ⑥ in pdf will be pasted as 6. So most instances are replaced with NFC. To simplify matching during search operation, NFKC is used. Related: bug 466521 M +12 -9 core/textpage.cpp M +1 -1 generators/poppler/generator_pdf.cpp https://invent.kde.org/graphics/okular/-/commit/322fd2d54e4226f6dbb4fb357a86931a5c790340