Bug 473495 - Give an option not to apply compatibility decomposition when copying
Summary: Give an option not to apply compatibility decomposition when copying
Status: REPORTED
Alias: None
Product: okular
Classification: Applications
Component: general (show other bugs)
Version: 22.12.3
Platform: Ubuntu Linux
: NOR normal
Target Milestone: ---
Assignee: Okular developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-18 02:58 UTC by Huanyu Liu
Modified: 2024-05-24 10:52 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Huanyu Liu 2023-08-18 02:58:36 UTC
SUMMARY
When copying from Okular, all characters will be completely decomposed, i.e. both canonical decomposition and compatibility decomposition will be applied (see the Unicode Standard). While canonical decompostion is fine at most time, compatibility decomposition is not always desired, since some formatting information will be lost. It is especially a problem for punctuations in Chinese, because we almost always use fullwidth characters, but they are defined as compatibility decomposable to their ASCII counterparts, which are almost never used (unless when mixed with Latin scripts, and those ASCII counterparts are used exclusively for them).


STEPS TO REPRODUCE
1. Create any text file with the content "你好,世界!"
2. Open it with Okular
3. Copy the content

OBSERVED RESULT
The copied result is "你好,世界!", where "," (U+FF0C) is turned into "," (U+U+002C) and "!" (U+FF01) is turned into "!" (U+0021).

EXPECTED RESULT
We should have an option to turn off compatibility decomposition (or canonical decomposition, just in case) and the content should be copied as-is.

SOFTWARE/OS VERSIONS
Linux: Ubuntu 23.04
KDE Plasma Version: 5.27.4
KDE Frameworks Version: 5.104.0
Qt Version: 5.15.8

ADDITIONAL INFORMATION
The version provided by Ubuntu may be a little old, but that shouldn't matter.
Comment 1 Bug Janitor Service 2024-03-05 09:58:13 UTC
A possibly relevant merge request was started @ https://invent.kde.org/graphics/okular/-/merge_requests/941
Comment 2 Sune Vuorela 2024-05-24 10:52:18 UTC
Git commit 322fd2d54e4226f6dbb4fb357a86931a5c790340 by Sune Vuorela, on behalf of Wendi Gan.
Committed on 24/05/2024 at 10:02.
Pushed by sune into branch 'master'.

fix Unicode Normalization: replace NFKC to NFC

Use NFC in copy, makeWord, and export functions, and NFKC for search operations.
NFKC may alter characters when copied or exported. For example ⑥ in pdf will be pasted as 6. So most instances are replaced with NFC.
To simplify matching during search operation, NFKC is used.
Related: bug 466521

M  +12   -9    core/textpage.cpp
M  +1    -1    generators/poppler/generator_pdf.cpp

https://invent.kde.org/graphics/okular/-/commit/322fd2d54e4226f6dbb4fb357a86931a5c790340