Bug 465305 - character counts are wrong when text includes emojis; each counted as two (2)
Summary: character counts are wrong when text includes emojis; each counted as two (2)
Status: RESOLVED INTENTIONAL
Alias: None
Product: frameworks-ktexteditor
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: unspecified
Platform: Other Other
: NOR normal
Target Milestone: ---
Assignee: KWrite Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-02-05 05:24 UTC by kdebugs
Modified: 2023-03-29 18:53 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description kdebugs 2023-02-05 05:24:25 UTC
SUMMARY
Product/Component unknown; sorry.  Observed in kate and konsole, so probably affects something they both depend on.

STEPS TO REPRODUCE (kate)
1. In kate, enter an emoji, e.g. 😊
2. Move cursor back and forth from before to after cursor.
3. Look at line:column indicator at bottom.

OBSERVED RESULT (kate)
Column jumps by two for a single character.

EXPECTED RESULT (kate)
Column should increase by only one per character.

STEPS TO REPRODUCE (konsole)
1. In kate, copy and paste the emoji until you have OVER 4000 (e.g. 4001).  (Remember that the column number will say 8003 at the end of a line with 4001 emojis.)
2. Select them all and copy to clipboard.
3. In konsole, run 'python3'.  Then type:
len("""
4. Press Ctrl+Shift+V (or go to Edit, Paste; or right-click and select Paste).

OBSERVED RESULT (konsole)
It will ask you if you want to paste X number of characters (e.g. 8002) instead of the correct number (e.g. 4001).
Answer 'yes'.  Then complete the python expression with:
""")
and hit enter.
The correct number of characters (e.g. 4001) is displayed.

EXPECTED RESULT (konsole)
It should count the characters correctly, not double-count them.

SOFTWARE/OS VERSIONS
Kubuntu 22.10
KDE Plasma Version: 5.25.5
KDE Frameworks Version: 5.98.0
Qt Version: 5.15.6
Kate 22.08.2
Konsole 22.08.2

ADDITIONAL INFORMATION
For casual users, the number of characters may not really matter, but for people like me who do programming or work on data projects, I need to know correct character counts, and not be wondering where did X number of characters go or where did X number of characters magically come from.  If it's a single Unicode code point (e.g. U+1F60A) then it needs to be treated as just one character, regardless of how many bytes it might require to encode in a particular encoding.  The whole point of working with text instead of bytes is that you can work with characters, not worrying about how things are encoded under the hood.
Comment 1 Nate Graham 2023-02-06 22:59:07 UTC
These are going to end up being individual bugs in each app, not something more general. Arbitrarily using this one for Kate; please file another for Konsole. Thanks!
Comment 2 Nate Graham 2023-02-06 22:59:45 UTC
Can reproduce in Kate.
Comment 3 Waqar Ahmed 2023-03-29 18:46:01 UTC
Fixing it consistently throughout KTextEditor/Kate is considered out of scope/hard to do in all the places. See https://invent.kde.org/frameworks/ktexteditor/-/merge_requests/533 for reasoning.
Comment 4 Christoph Cullmann 2023-03-29 18:53:15 UTC
Just to have some reasoning here:

We use everywhere indices into UCS2 strings as columns.

If we compute search matches, we use that, in the internal api we do that, e.g. for --column we do that.

It would be a major effort to alter that and I don't see that it makes sense to spend our time on that.

The cursor movement on editing is correct, that would be some issue, but that for rare characters the column offset is not as expected is IMHO no big issue.