450219 – Trailing RLM characters are not displayed.

Bug 450219 - Trailing RLM characters are not displayed.

Summary: Trailing RLM characters are not displayed.

Status:	REPORTED

Alias:	None

Product:	konsole
Classification:	Applications
Component:	font (show other bugs)
Version:	21.03.80
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Konsole Developer

URL:
Keywords:	rtl

Depends on:
Blocks:

Reported:	2022-02-14 14:10 UTC by Dotan Cohen
Modified:	2024-01-10 21:45 UTC (History)
CC List:	3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Dotan Cohen 2022-02-14 14:10:16 UTC

BACKGROUND:
The RLM character is a non-printing character that has RTL (Right-to-Left) directionality:
https://www.fileformat.info/info/unicode/char/200f/index.htm

ׁHere is the Hebrew word for peace, with a period AFTER the word. The period SHOULD be on the left side of the word, but because bugs.kde.org is a LTR (Left-to-Right) website it will erroneously appear on the right of the word:
שלום.

To resolve that, one places an RLM character AFTER the period:
שלום.‏

You can't see that RLM character after the period, but because it's there the period is properly shown to the left of the word.

STEPS TO REPRODUCE
1. Print text to Konsole with a trailing RLM character
2.
3.

OBSERVED RESULT
RLM character is NOT displayed. It is a non-printing character so it cannot be seen, but it's absence is noted by the period being on the right of the word.

EXPECTED RESULT
RLM character should displayed. It's presence would be noted by the period being on the left of the word.

SOFTWARE/OS VERSIONS
KDE Frameworks 5.68.0
Qt 5.12.8 (built against 5.12.8)

ADDITIONAL INFORMATION
Here we can see that the RLM character at the end is not affecting the display of the text. The period should be on the left. Echo is echoing two Hebrew characters, then a period, then the letter e. Then sed is replacing the e with the RLM:
$ echo "אב.e" | sed "s;e;$(echo -ne '\u200f');"
אב.

We can verify that the RLM is there with hd:
$ echo "אב.e" | sed "s;e;$(echo -ne '\u200f');" | hd
00000000 d7 90 d7 91 2e e2 80 8f 0a |.........|
00000009

The "e2 80 8f" bytes are the RLM, see the page linked above, which contains this text:
UTF-8 (hex): 0xE2 0x80 0x8F (e2808f)

Comment 1 ninjalj 2022-02-15 22:49:16 UTC

Current konsole strips RLM, among other General_Category=Other_Format (Cf) characters.

There is a pending merge request with a commit that changes this: https://invent.kde.org/utilities/konsole/-/merge_requests/567/diffs?commit_id=24216793f573192934f0d9e9d99ac312c5693cb6

Comment 2 Dotan Cohen 2022-02-16 06:18:18 UTC

Great, thanks. I wonder what problem that displayCharacter() method is intended to solve.

Comment 3 ninjalj 2022-02-16 19:35:36 UTC

Since you asked:
 
displayCharacter() assigns characters to character cells, and originally didn't support characters with no width (https://bugs.kde.org/show_bug.cgi?id=96536).  

Then, support was added for diacritics (Mark_NonSpacing) characters, by allowing a character cell to, instead of containing a character, to point to a sequence of characters (https://invent.kde.org/utilities/konsole/-/commit/c335324f31e946d4e3a0c63d1fbed8c114aea987). 

Later, support was added for Hangul medial and terminal Jamo, which have Letter_Other Unicode General_category (https://invent.kde.org/utilities/konsole/-/commit/437440978bca1bd84e70ee61ba7974f63fe0630a). 

The referenced commit in the pending merge request further adds support for zero-width Other_Format controls.

Comment 4 Dotan Cohen 2022-02-17 08:59:14 UTC

Thanks. I honestly think that all characters should be displayed. Every Unicode character and code point exists because somebody, somewhere, needs it.