5128781a824c26dc2746650fea0ae9f95861b9d8 broke emoji rendering, possibly rendering of more double width characters (but easiest to test with emojis). reverting it fixes the rendering.
Thanks, the emojis I tested with shows only left half of emoji. Mariusz, can you work on this before 18.08 is release later this month?
(In reply to Kurt Hindenburg from comment #1) > Thanks, the emojis I tested with shows only left half of emoji. > > Mariusz, can you work on this before 18.08 is release later this month? Sure
it's our wcwidth implementation that isn't updated for using uint32_t stuff. but now that we're using that, we can just use the platform wcwidth(), at least on linux (just putting «return wcwidth(ucs);» at the top of konsole_wcwidth() fixes this for me at least).
this is with a modern glibc on linux, btw, should probably test on other platforms, konsole_wcwidth() might be better.
There was an attempt to use wcwidth() https://phabricator.kde.org/D12238 I'll create review request this week with UCD files to width tables converter (for future table updates), Unicode 10 + UTR #51 v5.0 based table, and function to handle it.
yeah, I tried it myself earlier (using the system wcwidth), but it didn't go further than a discussion on IRC. I don't remember the exact reason I abandoned it, probably because wcwidth() varies so much between systems. what could work would be to fall back to konsole_wcwidth on platforms where it is completely broken, but that needs extensive testing. I also started on the same as you, but for upstream Qt, but I'm not sure if I have the code anymore. basically I wanted to add a system-independent wcwidth to qlocale, as well as update to unicode 10 in Qt. for reference, the Qt code for parsing the tables is here: https://code.woboq.org/qt5/qtbase/util/unicode/main.cpp.html
a temporary fix: https://phabricator.kde.org/D14262 would be appreciated if someone with an older glibc (that 2.27) or non-linux system could test this;
https://phabricator.kde.org/D12238 failed on FreeBSD; Martin, your patch tests also fails on FreeBSD ( actual -1, expected 2).
RC is in a week; I'll look at reverting for 18.08 over the weekend.
Git commit 9941ac6a5d6fb1c6d78121d48c6c0da1d2dfad73 by Kurt Hindenburg. Committed on 27/07/2018 at 12:34. Pushed by hindenburg into branch 'Applications/18.08'. Revert "Change internal character type size to 32 bit" This reverts commit 5128781a824c26dc2746650fea0ae9f95861b9d8. M +3 -3 src/Character.h M +4 -4 src/Emulation.cpp M +1 -1 src/Emulation.h M +16 -16 src/ExtendedCharTable.cpp M +8 -8 src/ExtendedCharTable.h M +1 -1 src/History.cpp M +1 -1 src/History.h M +10 -10 src/Screen.cpp M +4 -4 src/Screen.h M +5 -5 src/TerminalCharacterDecoder.cpp M +19 -21 src/TerminalDisplay.cpp M +2 -2 src/Vt102Emulation.cpp M +2 -2 src/Vt102Emulation.h M +24 -24 src/autotests/CharacterWidthTest.cpp M +7 -7 src/konsole_wcwidth.cpp M +1 -1 src/konsole_wcwidth.h https://commits.kde.org/konsole/9941ac6a5d6fb1c6d78121d48c6c0da1d2dfad73
Mariusz, are you still working on this? Any ETA?
I can finish it on the weekend, and publish next week. Sorry for huge delay.
Git commit 0f33ee504bc2e7cc8e466093131a15117f3d3460 by Kurt Hindenburg, on behalf of Mariusz Glebocki. Committed on 27/09/2018 at 02:46. Pushed by hindenburg into branch 'master'. Move character width functions to Character class Summary: Replaces direct konsole_wcwidth() calls with Character object property and the class static function, which internally still use konsole_wcwidth(). The change also prepares the codebase for a new code for determining character width. This way the change will barely touch existing code. Test Plan: konsole_wcwidth() is still used, just in one central place. Passed compilation and CharacterWidthTest should be enough. Reviewers: #konsole, hindenburg Reviewed By: #konsole, hindenburg Subscribers: hindenburg, konsole-devel Tags: #konsole Differential Revision: https://phabricator.kde.org/D15756 M +25 -0 src/Character.h M +0 -1 src/Filter.cpp M +1 -2 src/Screen.cpp M +1 -2 src/TerminalCharacterDecoder.cpp M +1 -2 src/TerminalDisplay.cpp M +2 -2 src/autotests/CharacterWidthTest.cpp https://commits.kde.org/konsole/0f33ee504bc2e7cc8e466093131a15117f3d3460
Git commit 5f32cb3c44f2b1cb3faa2efd2371da4af80dce2e by Kurt Hindenburg, on behalf of Mariusz Glebocki. Committed on 30/09/2018 at 16:22. Pushed by hindenburg into branch 'master'. Add a tool for generating character width tables Summary: The uni2characterwidth tool, converts Unicode Character Database files into character width lookup tables. It uses a template file to place the tables in a source code file together with a function for finding the width for specified character. It also allows to generate few forms of lists with width data for debug and test purposes, or for future use as a replacement of Unicode files. Set `KONSOLE_BUILD_UNI2CHARACTERWIDTH` cmake flag to build the tool. Use `--help` argument for more detailed usage. There is a possibility to generate separate "width" for Ambiguous characters. It can be used to add ability to configure the characters width in Konsole settings. The `example.template` file contains all possible named tags, and some additional tags to show how to use them. Depends on D15756 Test Plan: Download files listed below from `11.0.0` and `emoji/11.0` directories on `https://unicode.org/Public/`. You can also directly use URLs to the files. * UnicodeData.txt * EastAsianWidth.txt * emoji-data.txt Generate any available list except compact-ranges (e.g. `details`): ``` uni2characterwidth \ -U UnicodeData.txt -A EastAsianWidth.txt -E emoji-data.txt \ -g details result.txt ``` The list should contain ranges for all possible widths (-2, -1, 0, 1, 2). You can choose some characters with a width you know and check how they were classified. -2 is a special non-standard width for ambiguous characters, which can be overriden by adding `-a 1` or `-a 2` parameter. With this flag, all ranges from -2 group should disappear and become assigned to selected width (1 or 2). Generate output using a template: ``` uni2characterwidth \ -U UnicodeData.txt -A EastAsianWidth.txt -E emoji-data.txt \ -g code,./template.example result.txt ``` Reviewers: #konsole, hindenburg Reviewed By: #konsole, hindenburg Subscribers: hindenburg, konsole-devel Tags: #konsole Differential Revision: https://phabricator.kde.org/D15757 M +2 -1 src/CMakeLists.txt M +1 -0 tools/CMakeLists.txt A +30 -0 tools/uni2characterwidth/CMakeLists.txt A +78 -0 tools/uni2characterwidth/properties.h [License: UNKNOWN] * A +404 -0 tools/uni2characterwidth/template.cpp [License: GPL (v2+)] A +77 -0 tools/uni2characterwidth/template.example A +184 -0 tools/uni2characterwidth/template.h [License: GPL (v2+)] A +1011 -0 tools/uni2characterwidth/uni2characterwidth.cpp [License: GENERATED FILE] * The files marked with a * at the end have a non valid license. Please read: https://community.kde.org/Policies/Licensing_Policy and use the headers which are listed at that page. https://commits.kde.org/konsole/5f32cb3c44f2b1cb3faa2efd2371da4af80dce2e
Could you upload the resulting generated table somewhere? I'm currently working on documenting the used width status of different terminal implementations and would be interested in comparing this to the others in my data set. I assume having a comparison would be useful for (possible) further development too. My current dataset includes glibc-2.24 (debian stable), glibc-2.28 (glibc latest release), konsole-v18.03.80, linux-4.17, unifont-11.0.02 (actual width of glyphs), xterm (latest) and vte (latest).
Git commit e74cf6c36642247f3f79194da373d01a00645d36 by Kurt Hindenburg, on behalf of Mariusz Glebocki. Committed on 03/10/2018 at 15:11. Pushed by hindenburg into branch 'master'. Use new character width code based on Unicode 11 Summary: Adds a code for getting character width togeter with LUTs generated using uni2characterwidth from Unicode 11 lists. Skin tone, flags, gender, and other emoji with and modifer are not joined (you will see e.g. a skin tone square + generic yellow emoji). I think joining them would cause problems in most editors, command line prompts, and other programs which use character width data, as the characters would behave as combining or emoji depending on context (like ligatures). Examples: * light thumb up: 👍🏻 * dark thumb up: 👍🏿 * Polish flag: 🇵🇱 This behavior is allowed: * https://unicode.org/reports/tr51/#Emoji_Modifiers_Display * https://unicode.org/reports/tr51/#Emoji_ZWJ_Sequences It is possible to add support for sequences, but those would work only for a string width functions. Some characters which can be presented as emoji are narrow (e.g. ✖️, ©️). Those characters are listed without "presentation" mode, which means they should be rendered as text by default (real presentation depends on renderer and/or font). Noto Sans Color Emoji renders them as wide, DejaVu Sans as narrow. Vim, bash and zsh treat them as narrow, so I made them narrow. https://unicode.org/reports/tr51/#Presentation_Style Related: bug 378124, bug 392171, bug 339439 FIXED-IN: 18.12 Depends on D15757 Test Plan: * Look at emoji_test.txt - emojis should look "normal" (two characters width). * Look at GLASS.txt - characters width should look correct. * CharacterWidthTest should pass. * perl -XCSDL -e 'print map{chr($_), " "} 1..0xffff' Reviewers: #konsole, #vdg, hindenburg Reviewed By: #konsole, hindenburg Subscribers: hindenburg, broulik, ngraham, konsole-devel Tags: #konsole Differential Revision: https://phabricator.kde.org/D15758 D +0 -64 COPYING.Unicode M +1 -1 src/CMakeLists.txt M +2 -2 src/Character.h A +159 -0 src/CharacterWidth.cpp [License: GENERATED FILE] * A +8 -0 src/CharacterWidth.h [License: UNKNOWN] * A +102 -0 src/CharacterWidth.src.cpp [License: GPL (v2+)] M +1 -1 src/Filter.cpp M +1 -1 src/TerminalCharacterDecoder.cpp M +1 -1 src/TerminalDisplay.cpp M +6 -2 src/autotests/CharacterWidthTest.cpp D +0 -238 src/konsole_wcwidth.cpp D +0 -16 src/konsole_wcwidth.h A +3 -0 tools/uni2characterwidth/overrides.txt The files marked with a * at the end have a non valid license. Please read: https://community.kde.org/Policies/Licensing_Policy and use the headers which are listed at that page. https://commits.kde.org/konsole/e74cf6c36642247f3f79194da373d01a00645d36
Created attachment 115385 [details] Diff in width calculation between xterm-335, glibc-2.28, vte-glib-2.59.0 and konsole-master xterm does runtime detection which can lead to the libc wcwidth beeing used or the internal one.
Created attachment 115386 [details] Diff in width calculation between vte-glib-2.59.0 and konsole-master
This seems to be quite close to what vte/gnome-terminal uses. But not quite the same. vte combines hangul clusters so it's correct for them. It seems konsole does so too so maybe it should also list them as combining like vte?
(In reply to Martin Hostettler from comment #19) > This seems to be quite close to what vte/gnome-terminal uses. But not quite > the same. Surrogates technically should be invalid; they are intended to be combined in pairs (high+low) when using UTF16 to form a valid UTF32 code point. Decoder does the conversion, so valid pairs are passed as another code point. I made regional indicator letters wide manually because of vim, and width in noto sans color emoji and emoji one. glibc 2.27 treat them as narrow, so I'm not sure which width should we choose. > vte combines hangul clusters so it's correct for them. It seems konsole does > so too so maybe it should also list them as combining like vte? I was thinking about this, also with emoji skin tones. Vim and glibc 2.27 treat them as separate and narrow, so I opted for compatibility rather than good look. I don't know Korean nor its input methods, but I read they are almost never used in normal text, as everything is already present in composed form as separate code points. I think it should not cause problems in real world use cases.
About the Surrogates: You are right. I modified my script to ignore those like C0 and C1. They should never reach this code. About the hangul: Konsole before this change did display NFD in a way that looked correct to my untrained eye. After this change decomposed forms are displayed as 1 wide and 2 narrow characters. At least OS X uses NFD for file names, so i would expect that these decomposed forms will end up in places where konsole will encounter them. For the regional indicator symbols: Those are a whole new bag of fun. I wonder how other terminals handle emoji joining... glibc and glib (vte) consider them narrow. Unifont and Noto consider them wide. And what users actually want to see is two of them joined together to be a flag(supported in Noto seems to be 2 cells wide, not supported in unifont). Konsole see two of them as 2 wide characters (i.e. 4 cells). Interesting enough kitty displays a flag which is 4 cells wide. Not sure how expected that is. I wonder how much sense it makes to deviate from what glibc says when the actual display is really not that great anyway (for flags / regional indicator symbols). I'm worried about proliferation of lots of slightly different width mapping functions that applications *somehow* have to know about. tangent: I think in the long run we need some way for applications to download the actually used width mapping from the terminal to get decent interoperability across versions and different machines in a network. I just notices that emoji joining is yet another aspect that needs to be kept in sync between terminal and the application.
(In reply to Martin Hostettler from comment #21) > About the hangul: Konsole before this change did display NFD in a way that > looked correct to my untrained eye. After this change decomposed forms are > displayed as 1 wide and 2 narrow characters. At least OS X uses NFD for file > names, so i would expect that these decomposed forms will end up in places > where konsole will encounter them. Wow, I didn't know that. I'll send a patch which will change the widths to allow characters to compose. > For the regional indicator symbols: Those are a whole new bag of fun. I > wonder how other terminals handle emoji joining... > > glibc and glib (vte) consider them narrow. Unifont and Noto consider them > wide. And what users actually want to see is two of them joined together to > be a flag(supported in Noto seems to be 2 cells wide, not supported in > unifont). Konsole see two of them as 2 wide characters (i.e. 4 cells). > Interesting enough kitty displays a flag which is 4 cells wide. Not sure how > expected that is. > I wonder how much sense it makes to deviate from what glibc says when the > actual display is really not that great anyway (for flags / regional > indicator symbols). Nice idea in kitty. Looks as intended, but is still width-compatible (assuming each indicator is wide). Since glib(c) assume the indicators are narrow, it could be possible to display one wide flag and keep compatibility with wcwidth-using software (i.e. everything in ncurses, shells). It would work on FreeBSD too (wcwidth works only for 16 bit code points, higher CPs are narrow). Not sure how it would work on macOS. > I'm worried about proliferation of lots of slightly different width mapping > functions that applications *somehow* have to know about. > > tangent: > I think in the long run we need some way for applications to download the > actually used width mapping from the terminal to get decent interoperability > across versions and different machines in a network. I just notices that > emoji joining is yet another aspect that needs to be kept in sync between > terminal and the application. There is already POSIX solution which could solve that - wcwidth() and wcswidth(). The latter could even calculate widths of joined emojis. Sadly, it works differently on every platform. And there are separate implementations in Go, Python, etc. This is why terminals and console programs sometimes try to make its own table which works the same (or at least is up to date) on every platform. Or, if someone likes hacks, LD_PRELOAD=libkonsole_wcwidth.so ;)
(In reply to Mariusz Glebocki from comment #22) > (In reply to Martin Hostettler from comment #21) > > tangent: > > I think in the long run we need some way for applications to download the > > actually used width mapping from the terminal to get decent interoperability > > across versions and different machines in a network. I just notices that > > emoji joining is yet another aspect that needs to be kept in sync between > > terminal and the application. > > There is already POSIX solution which could solve that - wcwidth() and > wcswidth(). The latter could even calculate widths of joined emojis. Sadly, > it works differently on every platform. And there are separate > implementations in Go, Python, etc. > This is why terminals and console programs sometimes try to make its own > table which works the same (or at least is up to date) on every platform. > > Or, if someone likes hacks, LD_PRELOAD=libkonsole_wcwidth.so ;) Well if that would be the solution konsole would just use it, right? But: * There's the private use area. Only a component with access to the font can know what would look right for these characters. (Or the user could configure that, i'm sure most users don't want to do that manually) * width mapping is a moving target as long as unicode evolves. With the current release cycle even a fast moving target. * Rendering breaks badly when values in the terminal and the client application don't match. (that happens to every terminal with a up to date table) * the usage of ssh with terminals means that connections will span systems with differently outdated libraries. * LD_PRELOAD hacks are not really a sustainable solution. For me that sounds a lot like there needs to be a single point of truth in every terminal session. As long as things move faster than installations update and there is no standard all stakeholders can agree on (and UAX #11 explicitly disclaims being for "modern terminal emulators") there needs *some* way for applications to get this information. From the terminals i looked at only urxvt really buys in to "trust the system libc". And konsole, vte and xterm (i have to assume) intentionally diverge from what libc does. I think from the perspective of a client library the equivalent of roughly one full screen update with a decently sized terminal in data transfer would be an quite attractive trade of ;)