Bug 396435

Summary: emoji rendering broken
Product: [Applications] konsole Reporter: Martin Sandsmark <martin.sandsmark>
Component: fontAssignee: Mariusz Glebocki <mglb>
Status: RESOLVED FIXED    
Severity: normal CC: achilleas.k, mglb, nate, textshell
Priority: NOR    
Version: master   
Target Milestone: ---   
Platform: Other   
OS: Linux   
Latest Commit: Version Fixed In: 18.12
Attachments: Diff in width calculation between xterm-335, glibc-2.28, vte-glib-2.59.0 and konsole-master
Diff in width calculation between vte-glib-2.59.0 and konsole-master

Description Martin Sandsmark 2018-07-12 11:32:43 UTC
5128781a824c26dc2746650fea0ae9f95861b9d8 broke emoji rendering, possibly rendering of more double width characters (but easiest to test with emojis).

reverting it fixes the rendering.
Comment 1 Kurt Hindenburg 2018-07-12 12:53:50 UTC
Thanks, the emojis I tested with shows only left half of emoji.

Mariusz, can you work on this before 18.08 is release later this month?
Comment 2 Mariusz Glebocki 2018-07-12 22:09:05 UTC
(In reply to Kurt Hindenburg from comment #1)
> Thanks, the emojis I tested with shows only left half of emoji.
> 
> Mariusz, can you work on this before 18.08 is release later this month?

Sure
Comment 3 Martin Sandsmark 2018-07-16 15:03:13 UTC
it's our wcwidth implementation that isn't updated for using uint32_t stuff. but now that we're using that, we can just use the platform wcwidth(), at least on linux (just putting «return wcwidth(ucs);» at the top of konsole_wcwidth() fixes this for me at least).
Comment 4 Martin Sandsmark 2018-07-16 15:06:01 UTC
this is with a modern glibc on linux, btw, should probably test on other platforms, konsole_wcwidth() might be better.
Comment 5 Mariusz Glebocki 2018-07-17 15:51:40 UTC
There was an attempt to use wcwidth() https://phabricator.kde.org/D12238

I'll create review request this week with UCD files to width tables converter (for future table updates), Unicode 10 + UTR #51 v5.0 based table, and function to handle it.
Comment 6 Martin Sandsmark 2018-07-21 11:19:40 UTC
yeah, I tried it myself earlier (using the system wcwidth), but it didn't go further than a discussion on IRC. I don't remember the exact reason I abandoned it, probably because wcwidth() varies so much between systems. what could work would be to fall back to konsole_wcwidth on platforms where it is completely broken, but that needs extensive testing.

I also started on the same as you, but for upstream Qt, but I'm not sure if I have the code anymore. basically I wanted to add a system-independent wcwidth to qlocale, as well as update to unicode 10 in Qt.

for reference, the Qt code for parsing the tables is here: https://code.woboq.org/qt5/qtbase/util/unicode/main.cpp.html
Comment 7 Martin Sandsmark 2018-07-21 11:42:17 UTC
a temporary fix: https://phabricator.kde.org/D14262

would be appreciated if someone with an older glibc (that 2.27) or non-linux system could test this;
Comment 8 Kurt Hindenburg 2018-07-25 13:11:29 UTC
https://phabricator.kde.org/D12238 failed on FreeBSD; Martin, your patch tests also fails on FreeBSD ( actual -1, expected 2).
Comment 9 Kurt Hindenburg 2018-07-26 13:02:10 UTC
RC is in a week;  I'll look at reverting for 18.08 over the weekend.
Comment 10 Kurt Hindenburg 2018-07-28 00:15:55 UTC
Git commit 9941ac6a5d6fb1c6d78121d48c6c0da1d2dfad73 by Kurt Hindenburg.
Committed on 27/07/2018 at 12:34.
Pushed by hindenburg into branch 'Applications/18.08'.

Revert "Change internal character type size to 32 bit"

This reverts commit 5128781a824c26dc2746650fea0ae9f95861b9d8.

M  +3    -3    src/Character.h
M  +4    -4    src/Emulation.cpp
M  +1    -1    src/Emulation.h
M  +16   -16   src/ExtendedCharTable.cpp
M  +8    -8    src/ExtendedCharTable.h
M  +1    -1    src/History.cpp
M  +1    -1    src/History.h
M  +10   -10   src/Screen.cpp
M  +4    -4    src/Screen.h
M  +5    -5    src/TerminalCharacterDecoder.cpp
M  +19   -21   src/TerminalDisplay.cpp
M  +2    -2    src/Vt102Emulation.cpp
M  +2    -2    src/Vt102Emulation.h
M  +24   -24   src/autotests/CharacterWidthTest.cpp
M  +7    -7    src/konsole_wcwidth.cpp
M  +1    -1    src/konsole_wcwidth.h

https://commits.kde.org/konsole/9941ac6a5d6fb1c6d78121d48c6c0da1d2dfad73
Comment 11 Kurt Hindenburg 2018-09-05 14:40:18 UTC
Mariusz, are you still working on this?  Any ETA?
Comment 12 Mariusz Glebocki 2018-09-07 15:45:36 UTC
I can finish it on the weekend, and publish next week.
Sorry for huge delay.
Comment 13 Kurt Hindenburg 2018-09-27 02:46:27 UTC
Git commit 0f33ee504bc2e7cc8e466093131a15117f3d3460 by Kurt Hindenburg, on behalf of Mariusz Glebocki.
Committed on 27/09/2018 at 02:46.
Pushed by hindenburg into branch 'master'.

Move character width functions to Character class

Summary:
Replaces direct konsole_wcwidth() calls with Character object property
and the class static function, which internally still use
konsole_wcwidth().

The change also prepares the codebase for a new code for determining
character width. This way the change will barely touch existing code.

Test Plan:
konsole_wcwidth() is still used, just in one central place. Passed
compilation and CharacterWidthTest should be enough.

Reviewers: #konsole, hindenburg

Reviewed By: #konsole, hindenburg

Subscribers: hindenburg, konsole-devel

Tags: #konsole

Differential Revision: https://phabricator.kde.org/D15756

M  +25   -0    src/Character.h
M  +0    -1    src/Filter.cpp
M  +1    -2    src/Screen.cpp
M  +1    -2    src/TerminalCharacterDecoder.cpp
M  +1    -2    src/TerminalDisplay.cpp
M  +2    -2    src/autotests/CharacterWidthTest.cpp

https://commits.kde.org/konsole/0f33ee504bc2e7cc8e466093131a15117f3d3460
Comment 14 Kurt Hindenburg 2018-09-30 16:22:42 UTC
Git commit 5f32cb3c44f2b1cb3faa2efd2371da4af80dce2e by Kurt Hindenburg, on behalf of Mariusz Glebocki.
Committed on 30/09/2018 at 16:22.
Pushed by hindenburg into branch 'master'.

Add a tool for generating character width tables

Summary:
The uni2characterwidth tool, converts Unicode Character Database files
into character width lookup tables. It uses a template file to place
the tables in a source code file together with a function for finding
the width for specified character. It also allows to generate few forms
of lists with width data for debug and test purposes, or for future use
as a replacement of Unicode files.

Set `KONSOLE_BUILD_UNI2CHARACTERWIDTH` cmake flag to build the tool.
Use `--help` argument for more detailed usage.

There is a possibility to generate separate "width" for Ambiguous
characters. It can be used to add ability to configure the characters
width in Konsole settings.

The `example.template` file contains all possible named tags, and some
additional tags to show how to use them.

Depends on D15756

Test Plan:
Download files listed below from `11.0.0` and `emoji/11.0` directories
on `https://unicode.org/Public/`. You can also directly use URLs to the
files.

* UnicodeData.txt
* EastAsianWidth.txt
* emoji-data.txt

Generate any available list except compact-ranges (e.g. `details`):

```
uni2characterwidth \
    -U UnicodeData.txt  -A EastAsianWidth.txt  -E emoji-data.txt \
    -g details  result.txt
```

The list should contain ranges for all possible widths
(-2, -1, 0, 1, 2). You can choose some characters with a width you know
and check how they were classified. -2 is a special non-standard width
for ambiguous characters, which can be overriden by adding `-a 1` or
`-a 2` parameter. With this flag, all ranges from -2 group should
disappear and become assigned to selected width (1 or 2).

Generate output using a template:

```
uni2characterwidth \
    -U UnicodeData.txt  -A EastAsianWidth.txt  -E emoji-data.txt \
    -g code,./template.example  result.txt
```

Reviewers: #konsole, hindenburg

Reviewed By: #konsole, hindenburg

Subscribers: hindenburg, konsole-devel

Tags: #konsole

Differential Revision: https://phabricator.kde.org/D15757

M  +2    -1    src/CMakeLists.txt
M  +1    -0    tools/CMakeLists.txt
A  +30   -0    tools/uni2characterwidth/CMakeLists.txt
A  +78   -0    tools/uni2characterwidth/properties.h     [License: UNKNOWN]  *
A  +404  -0    tools/uni2characterwidth/template.cpp     [License: GPL (v2+)]
A  +77   -0    tools/uni2characterwidth/template.example
A  +184  -0    tools/uni2characterwidth/template.h     [License: GPL (v2+)]
A  +1011 -0    tools/uni2characterwidth/uni2characterwidth.cpp     [License: GENERATED FILE]  *

The files marked with a * at the end have a non valid license. Please read: https://community.kde.org/Policies/Licensing_Policy and use the headers which are listed at that page.


https://commits.kde.org/konsole/5f32cb3c44f2b1cb3faa2efd2371da4af80dce2e
Comment 15 Martin Hostettler 2018-10-03 12:45:49 UTC
Could you upload the resulting generated table somewhere?

I'm currently working on documenting the used width status of different terminal implementations and would be interested in comparing this to the others in my data set. I assume having a comparison would be useful for (possible) further development too.

My current dataset includes glibc-2.24 (debian stable), glibc-2.28 (glibc latest release), konsole-v18.03.80, linux-4.17, unifont-11.0.02 (actual width of glyphs), xterm (latest) and vte (latest).
Comment 16 Kurt Hindenburg 2018-10-03 15:11:28 UTC
Git commit e74cf6c36642247f3f79194da373d01a00645d36 by Kurt Hindenburg, on behalf of Mariusz Glebocki.
Committed on 03/10/2018 at 15:11.
Pushed by hindenburg into branch 'master'.

Use new character width code based on Unicode 11

Summary:
Adds a code for getting character width togeter with LUTs generated
using uni2characterwidth from Unicode 11 lists.

Skin tone, flags, gender, and other emoji with and modifer are not
joined (you will see e.g. a skin tone square + generic yellow emoji).
I think joining them would cause problems in most editors, command line
prompts, and other programs which use character width data, as the
characters would behave as combining or emoji depending on context (like
ligatures).

Examples:
* light thumb up: 👍🏻
* dark thumb up:  👍🏿
* Polish flag:    🇵🇱

This behavior is allowed:
* https://unicode.org/reports/tr51/#Emoji_Modifiers_Display
* https://unicode.org/reports/tr51/#Emoji_ZWJ_Sequences

It is possible to add support for sequences, but those would work
only for a string width functions.

Some characters which can be presented as emoji are narrow (e.g. ✖️, ©️).
Those characters are listed without "presentation" mode, which means
they should be rendered as text by default (real presentation depends on
renderer and/or font). Noto Sans Color Emoji renders them as wide,
DejaVu Sans as narrow. Vim, bash and zsh treat them as narrow, so I made
them narrow.

https://unicode.org/reports/tr51/#Presentation_Style
Related: bug 378124, bug 392171, bug 339439

FIXED-IN: 18.12

Depends on D15757

Test Plan:
* Look at emoji_test.txt - emojis should look "normal" (two characters
width).
* Look at GLASS.txt - characters width should look correct.
* CharacterWidthTest should pass.
* perl -XCSDL -e 'print map{chr($_), " "} 1..0xffff'

Reviewers: #konsole, #vdg, hindenburg

Reviewed By: #konsole, hindenburg

Subscribers: hindenburg, broulik, ngraham, konsole-devel

Tags: #konsole

Differential Revision: https://phabricator.kde.org/D15758

D  +0    -64   COPYING.Unicode
M  +1    -1    src/CMakeLists.txt
M  +2    -2    src/Character.h
A  +159  -0    src/CharacterWidth.cpp     [License: GENERATED FILE]  *
A  +8    -0    src/CharacterWidth.h     [License: UNKNOWN]  *
A  +102  -0    src/CharacterWidth.src.cpp     [License: GPL (v2+)]
M  +1    -1    src/Filter.cpp
M  +1    -1    src/TerminalCharacterDecoder.cpp
M  +1    -1    src/TerminalDisplay.cpp
M  +6    -2    src/autotests/CharacterWidthTest.cpp
D  +0    -238  src/konsole_wcwidth.cpp
D  +0    -16   src/konsole_wcwidth.h
A  +3    -0    tools/uni2characterwidth/overrides.txt

The files marked with a * at the end have a non valid license. Please read: https://community.kde.org/Policies/Licensing_Policy and use the headers which are listed at that page.


https://commits.kde.org/konsole/e74cf6c36642247f3f79194da373d01a00645d36
Comment 17 Martin Hostettler 2018-10-03 15:52:05 UTC
Created attachment 115385 [details]
Diff in width calculation between xterm-335, glibc-2.28, vte-glib-2.59.0 and konsole-master

xterm does runtime detection which can lead to the libc wcwidth beeing used or the internal one.
Comment 18 Martin Hostettler 2018-10-03 15:53:34 UTC
Created attachment 115386 [details]
Diff in width calculation between vte-glib-2.59.0 and konsole-master
Comment 19 Martin Hostettler 2018-10-03 15:57:59 UTC
This seems to be quite close to what vte/gnome-terminal uses. But not quite the same.

vte combines hangul clusters so it's correct for them. It seems konsole does so too so maybe it should also list them as combining like vte?
Comment 20 Mariusz Glebocki 2018-10-03 17:25:52 UTC
(In reply to Martin Hostettler from comment #19)
> This seems to be quite close to what vte/gnome-terminal uses. But not quite
> the same.

Surrogates technically should be invalid; they are intended to be combined in pairs (high+low) when using UTF16 to form a valid UTF32 code point. Decoder does the conversion, so valid pairs are passed as another code point.

I made regional indicator letters wide manually because of vim, and width in noto sans color emoji and emoji one. glibc 2.27 treat them as narrow, so I'm not sure which width should we choose.

> vte combines hangul clusters so it's correct for them. It seems konsole does
> so too so maybe it should also list them as combining like vte?

I was thinking about this, also with emoji skin tones. Vim and glibc 2.27 treat them as separate and narrow, so I opted for compatibility rather than good look. I don't know Korean nor its input methods, but I read they are almost never used in normal text, as everything is already present in composed form as separate code points. I think it should not cause problems in real world use cases.
Comment 21 Martin Hostettler 2018-10-04 19:23:59 UTC
About the Surrogates: You are right. I modified my script to ignore those like C0 and C1. They should never reach this code.

About the hangul: Konsole before this change did display NFD in a way that looked correct to my untrained eye. After this change decomposed forms are displayed as 1 wide and 2 narrow characters. At least OS X uses NFD for file names, so i would expect that these decomposed forms will end up in places where konsole will encounter them.

For the regional indicator symbols: Those are a whole new bag of fun. I wonder how other terminals handle emoji joining...

glibc and glib (vte) consider them narrow. Unifont and Noto consider them wide. And what users actually want to see is two of them joined together to be a flag(supported in Noto seems to be 2 cells wide, not supported in unifont). Konsole see two of them as 2 wide characters (i.e. 4 cells). Interesting enough kitty displays a flag which is 4 cells wide. Not sure how expected that is. 

I wonder how much sense it makes to deviate from what glibc says when the actual display is really not that great anyway (for flags / regional indicator symbols).

I'm worried about proliferation of lots of slightly different width mapping functions that applications *somehow* have to know about.

tangent:
I think in the long run we need some way for applications to download the actually used width mapping from the terminal to get decent interoperability across versions and different machines in a network. I just notices that emoji joining is yet another aspect that needs to be kept in sync between terminal and the application.
Comment 22 Mariusz Glebocki 2018-10-05 20:40:10 UTC
(In reply to Martin Hostettler from comment #21)
> About the hangul: Konsole before this change did display NFD in a way that
> looked correct to my untrained eye. After this change decomposed forms are
> displayed as 1 wide and 2 narrow characters. At least OS X uses NFD for file
> names, so i would expect that these decomposed forms will end up in places
> where konsole will encounter them.

Wow, I didn't know that. I'll send a patch which will change the widths to allow characters to compose.


> For the regional indicator symbols: Those are a whole new bag of fun. I
> wonder how other terminals handle emoji joining...
> 
> glibc and glib (vte) consider them narrow. Unifont and Noto consider them
> wide. And what users actually want to see is two of them joined together to
> be a flag(supported in Noto seems to be 2 cells wide, not supported in
> unifont). Konsole see two of them as 2 wide characters (i.e. 4 cells).
> Interesting enough kitty displays a flag which is 4 cells wide. Not sure how
> expected that is. 
> I wonder how much sense it makes to deviate from what glibc says when the
> actual display is really not that great anyway (for flags / regional
> indicator symbols).

Nice idea in kitty. Looks as intended, but is still width-compatible (assuming each indicator is wide). Since glib(c) assume the indicators are narrow, it could be possible to display one wide flag and keep compatibility with wcwidth-using software (i.e. everything in ncurses, shells). It would work on FreeBSD too (wcwidth works only for 16 bit code points, higher CPs are narrow). Not sure how it would work on macOS.


> I'm worried about proliferation of lots of slightly different width mapping
> functions that applications *somehow* have to know about.
> 
> tangent:
> I think in the long run we need some way for applications to download the
> actually used width mapping from the terminal to get decent interoperability
> across versions and different machines in a network. I just notices that
> emoji joining is yet another aspect that needs to be kept in sync between
> terminal and the application.

There is already POSIX solution which could solve that - wcwidth() and wcswidth(). The latter could even calculate widths of joined emojis. Sadly, it works differently on every platform. And there are separate implementations in Go, Python, etc.
This is why terminals and console programs sometimes try to make its own table which works the same (or at least is up to date) on every platform.

Or, if someone likes hacks, LD_PRELOAD=libkonsole_wcwidth.so ;)
Comment 23 Martin Hostettler 2018-10-05 21:13:45 UTC
(In reply to Mariusz Glebocki from comment #22)
> (In reply to Martin Hostettler from comment #21)
> > tangent:
> > I think in the long run we need some way for applications to download the
> > actually used width mapping from the terminal to get decent interoperability
> > across versions and different machines in a network. I just notices that
> > emoji joining is yet another aspect that needs to be kept in sync between
> > terminal and the application.
> 
> There is already POSIX solution which could solve that - wcwidth() and
> wcswidth(). The latter could even calculate widths of joined emojis. Sadly,
> it works differently on every platform. And there are separate
> implementations in Go, Python, etc.
> This is why terminals and console programs sometimes try to make its own
> table which works the same (or at least is up to date) on every platform.
> 
> Or, if someone likes hacks, LD_PRELOAD=libkonsole_wcwidth.so ;)

Well if that would be the solution konsole would just use it, right?

But:
* There's the private use area. Only a component with access to the font can know what would look right for these characters. (Or the user could configure that, i'm sure most users don't want to do that manually)
* width mapping is a moving target as long as unicode evolves. With the current release cycle even a fast moving target.
* Rendering breaks badly when values in the terminal and the client application don't match. (that happens to every terminal with a up to date table)
* the usage of ssh with terminals means that connections will span systems with differently outdated libraries.
* LD_PRELOAD hacks are not really a sustainable solution.

For me that sounds a lot like there needs to be a single point of truth in every  terminal session. As long as things move faster than installations update and there is no standard all stakeholders can agree on (and UAX #11 explicitly disclaims being for "modern terminal emulators") there needs *some* way for applications to get this information. From the terminals i looked at only urxvt really buys in to "trust the system libc". And konsole, vte and xterm (i have to assume) intentionally diverge from what libc does. I think from the perspective of a client library the equivalent of roughly one full screen update with a decently sized terminal in data transfer would be an quite attractive trade of ;)