Bug 96536

Summary: Unicode decomposed text gets garbled in Konsole (NFD mode)
Product: [Applications] konsole Reporter: Thiago Macieira <thiago>
Component: generalAssignee: Konsole Developer <konsole-devel>
Status: RESOLVED FIXED    
Severity: minor CC: aacid, ach, albbas, chrislb, dahalaishraj, ellingsw+20759, jnelson-kde, kde, maarizwan, mfabian, ott, praveen, sieburgh, yann
Priority: NOR    
Version: 1.5   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In: 4.8.0
Sentry Crash Report:
Attachments: WorkInProgress patch
RB patch w/o const changes and whitespaces changes.

Description Thiago Macieira 2005-01-07 17:26:46 UTC
Version:           1.5 Beta (using KDE 3.3.91 (beta1), compiled sources)
Compiler:          gcc version 3.4.3
OS:                Linux (i686) release 2.6.9

There's an open bug report dealing with generic Unicode problems in Konsole: bug #74190. This bug here is about one specific problem.

There are two forms of diacritic characters in Unicode: the composed and the decomposed form. In the first one, there's a single codepoint assigned for a given letter+diacritic. In the second one, the combination is made of two "characters": the base one, without the diacritic, and one combining-diacritic character.

For instance, the LATIN SMALL LETTER A WITH ACUTE (á) is assigned U+00E1. That's the composed, or NFC, form. But it's also possible to generate the same glyph by combining LATIN SMALL LETTER A (a) with COMBINING ACUTE ACCENT: U+0061 U+0301: á (depending on your font, it may show as an "a" with a block above). That's the decomposed, or NFD, form.

The problem is that Konsole turns some of those combinations from NFD to NFC:

$ echo á | od -tx1
0000000 61 cc 81 0a
$ touch á
$ ls
á

Now, if you copy & paste the listed value, here's what happens:
$ echo á | od -tx1
0000000 c3 a1 0a
$ ls á
ls: á: No such file or directory

For other characters, the combining modifier is simply discarded:
$ touch d́
$ ls
á   d́

copy & paste:
$ ls d
ls: d: No such file or directory

Just for fun, let's try adding another combining char: COMBINING ACUTE ACCENT BELOW (U+0317): á̗

(copy & paste from Konsole:)
$ ls
á  d

For comparison:
- Konqueror works fine. No glitches.
- xterm has glitches. NFD d́ doesn't get changed, but NFD á changes to á, whereas NFD á̗ changes into a mixed form (NFC á + combining)
Comment 1 Waldo Bastian 2005-01-07 18:23:43 UTC
Konsole's internal representation stores one character per screen position and has no room to store the NFD form, for that reason it tries to convert to NFC before storing the characters. When you copy&paste you will get the stored characters.
Comment 2 Thiago Macieira 2005-01-07 19:17:34 UTC
Which is wrong, because it's the wrong representation. Also, it makes some glyphs unreadable, because it will discard some combinations when doing NFC.

It's also lacking in the sense that d + acute does not render as ḋ in Konsole.
Comment 3 Thiago Macieira 2005-02-13 03:44:08 UTC
Changing priority. I have a couple of ideas I may try in the future for KDE 4.
Comment 4 Thiago Macieira 2005-04-28 13:55:21 UTC
*** Bug 104691 has been marked as a duplicate of this bug. ***
Comment 5 Christoph Feck 2010-01-06 20:28:20 UTC
*** Bug 221508 has been marked as a duplicate of this bug. ***
Comment 6 Steven Elling 2010-01-13 08:22:29 UTC
Well it's the future and this problem doesn't appear to be fixed.

I'm running KDE 4.3.1 and decomposed characters with diacritical marks are only represented as the base character.

For example, in my case ë (that's the decomposed form of e with a umlaut) is just displayed as e.
Comment 7 Jon Nelson 2010-06-17 23:26:08 UTC
Please fix!  gnome-terminal and xterm both display NFD unicode properly.
Due to konsole's inability to display NFD characters I chased a particular bug around in other software for HOURS.

Here is an easy way to test (needs python):

#! /usr/bin/python
import unicodedata

u = u'Ha\u0308mikon'
u1 = unicodedata.normalize('NFC', u)
u2 = unicodedata.normalize('NFD', u)
u3 = unicodedata.normalize('NFKD', u)
u4 = unicodedata.normalize('NFKC', u)
print u1, u2, u3, u4




The above should print what appears to be the *same* word 4 times.
konsole is the only console tested that does not work.


I'm running KDE 4.4.3.
Comment 8 Jon Nelson 2010-08-12 04:58:42 UTC
Now on KDE 4.5.0

This is not a "NEW" bug.
It's been around for *5 years* and konsole *still* can't display unicode characters properly!

Is anybody working on a fix for this?

I certainly wouldn't consider this a minor issue.
Comment 9 Albert Astals Cid 2011-05-15 21:37:12 UTC
I'm having a look, but i'm a complete newbie in konsole codebase so can't promise anything
Comment 10 Albert Astals Cid 2011-05-17 00:31:02 UTC
Created attachment 60067 [details]
WorkInProgress patch

This is a work in progress patch, with it i can show files that contain e + comgining ring, if anyone is bored and gives it a try i'd be happy to hear the experiences.
Comment 11 Albert Astals Cid 2011-06-22 00:47:51 UTC
Full working patch (as far as my testing goes) at https://git.reviewboard.kde.org/r/101721/

I would really like people testing this since as far as i know it works perfectly
Comment 12 Kurt Hindenburg 2011-06-25 21:29:47 UTC
Created attachment 61327 [details]
RB patch w/o const changes and whitespaces changes.

I just removed the const and whitespace changes to get a cleaner diff.

This patch is big and I'm not familiar w/ all the code.

It does appear to fix the given issue.
Comment 13 Albert Astals Cid 2011-06-25 22:31:14 UTC
Kurt, I'm pretty confident about the code (though has a limitation of only being able to show 65534 different composed characters at a time (a reasonable limitation if you ask me and a sure improvement from not working :D))

Since it is a quite "big-ish" patch and you don't seem confortable with the code i propose we commit it to master (that will be KDE 4.8) so we have time to fix stuff if it breaks before the next release, what you say?
Comment 14 Kurt Hindenburg 2011-06-25 23:50:20 UTC
Yes, I should have mentioned that I had planned to commit to master so people could test it.
Comment 15 Albert Astals Cid 2011-06-26 00:29:48 UTC
Any reason you want to do it yourself? I mean it makes more sense if i do it so people can correctly find who to blame from the log :D
Comment 16 Kurt Hindenburg 2011-06-26 14:22:11 UTC
Go ahead although I'd prefer if you'd split/pdh the patch into the const, whitespace and the real patch.
Comment 17 Albert Astals Cid 2011-06-26 15:49:59 UTC
Commited to master
Comment 18 Kurt Hindenburg 2011-07-20 02:55:15 UTC
*** Bug 255862 has been marked as a duplicate of this bug. ***
Comment 19 Jekyll Wu 2011-08-02 12:51:17 UTC
*** Bug 276301 has been marked as a duplicate of this bug. ***
Comment 20 Christoph Feck 2011-08-12 18:15:47 UTC
*** Bug 279978 has been marked as a duplicate of this bug. ***
Comment 21 Jekyll Wu 2011-08-16 05:05:37 UTC
*** Bug 217684 has been marked as a duplicate of this bug. ***
Comment 22 Jekyll Wu 2011-08-16 05:16:52 UTC
*** Bug 226024 has been marked as a duplicate of this bug. ***
Comment 23 Jekyll Wu 2011-08-16 05:31:53 UTC
*** Bug 149777 has been marked as a duplicate of this bug. ***
Comment 24 Jekyll Wu 2011-08-16 07:43:54 UTC
*** Bug 116251 has been marked as a duplicate of this bug. ***
Comment 25 Jekyll Wu 2011-08-16 07:50:02 UTC
*** Bug 156071 has been marked as a duplicate of this bug. ***
Comment 26 Kurt Hindenburg 2011-10-16 14:58:37 UTC
Albert, can you double-check that this patch causes a 'cat tests/9x15.repertoire-utf8' to crash konsole?
Comment 27 Albert Astals Cid 2011-10-16 16:53:37 UTC
Fixed that crash. I had an incorrect assumption.
Comment 28 Steven Elling 2012-11-20 06:15:43 UTC
Removed my votes for this bug as I quit using KDE a while ago due to its bloat.
Comment 29 Børre Gaup 2020-11-23 10:38:34 UTC
This problem still exists in Konsole 20.08.3.

Running the python program below printing the string Hämikon illustrates the problem.
Comment 30 Albert Astals Cid 2020-11-23 19:00:53 UTC
The program below? where?
Comment 31 Børre Gaup 2020-11-24 18:22:27 UTC
(In reply to Albert Astals Cid from comment #30)
> The program below? where?

Sorry for being unclear, I'm talking about the script in comment 7.
Comment 32 Albert Astals Cid 2020-11-24 22:54:19 UTC
Looks good to me https://i.imgur.com/ZQZU3xm.png

Anything i'm missing?
Comment 33 Børre Gaup 2020-11-25 11:46:55 UTC
The first is Konsole, the second is gnome-terminal. xterm and uxterm have the same result as gnome-terminal
https://imgur.com/a/6SxxG3j

System info:
Operating System: KDE neon 5.20
KDE Plasma Version: 5.20.3
KDE Frameworks Version: 5.76.0
Qt Version: 5.15.1
Kernel Version: 5.4.0-54-generic
OS Type: 64-bit
Processors: 4 × Intel® Core™ i7-5600U CPU @ 2.60GHz
Memory: 15.5 GiB of RAM
Graphics Processor: Mesa Intel® HD Graphics 5500
Comment 34 Børre Gaup 2020-11-25 12:30:35 UTC
Locale and such on my machine:
❯ echo $LC_ALL $LANGUAGE $LANG
se_NO.UTF-8 se:nb:en_US nb_NO.UTF-8

Running the script this way:
LC_ALL=C python3 bla.py
gave the same result.
Comment 35 Albert Astals Cid 2020-11-25 17:50:05 UTC
I don't see the problem.

Or is the problem "the font I am using in konsole is different to the one i'm using in gnome-terminal and the ä looks weird"?
Comment 36 Børre Gaup 2020-11-26 10:42:05 UTC
It has to do with fonts, yes. 

I found a font without the problem in Konsole, and tested with the same font in gnome-terminal, both show the expected result.

I use Hack in konsole, so I tested Hack on gnome-terminal. gnome-terminal is not affected, while konsole is.

gnome-terminal on the left, konsole on the right.

https://imgur.com/a/ovQA7z2
Comment 37 Albert Astals Cid 2020-11-26 22:58:25 UTC
I'm going to say there's a bug either in Hack or in Qt, not much we can do if with other fonts works fine.