Bug 395171 - Remove UTF-16 and other non ASCII compatible encodings
Summary: Remove UTF-16 and other non ASCII compatible encodings
Status: REPORTED
Alias: None
Product: konsole
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR minor
Target Milestone: ---
Assignee: Konsole Developer
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-09 09:52 UTC by Egmont Koblinger
Modified: 2021-05-02 20:02 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Egmont Koblinger 2018-06-09 09:52:36 UTC
Non ASCII compatible encodings (UTF-16, UCS-2; not sure about UTF-7) should be removed from the list of offered encodings.

Any valid character might easily include a byte in the 0x00 - 0x1F range, which will trigger a special action according to the kernel's line discipline, such as send interrupt to the foreground process, send EOF, newline, wipe out the buffer etc., or just simply being echoed back in a way that's broken in UTF-16.

That is, the behavior is bound to be broken big time, there's nothing Konsole could do to fix this.

The kernel expects the data on the terminal lines to be ASCII compatible.

I'm almost certain the Linux kernel doesn't support UTF-16 here, and I don't think other Unixes do either. (If some does, offering this option should be limited to these architectures.)

Let alone Konsole's UTF-16 mode inserts a BOM in front of every chunk of input, which is also broken.

See https://unix.stackexchange.com/questions/448745/strange-konsole-character-encoding-behavior/448774 for a related question and my more detailed answer.

(Bug 115113 might be relevant, UTF-16 support _may_ have been added as part of that bug. Note however that while the summary of the bug mentions both UTF-8 and UTF-16, the description doesn't give any reason whatsoever why UTF-16 was asked for.)
Comment 1 Egmont Koblinger 2018-06-09 11:02:35 UTC
(Forget the last paragraph. I didn't realize it wasn't a konsole bug.)
Comment 2 Justin Zobel 2020-11-03 02:17:36 UTC
Thanks for the detailed report Egmont. Can you please confirm this issue still occurs in recent konsole versions. I couldn't find anything about encodings in konsole.
Comment 3 Bug Janitor Service 2020-11-18 04:33:49 UTC
Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!
Comment 4 Egmont Koblinger 2020-11-18 08:24:01 UTC
> Can you please confirm this issue still occurs in recent konsole versions.

I have version 19.12.3.

It's Settings -> Edit Current Profile -> Advanced -> Default character encoding.

Also right-click on the terminal -> Set Encoding.

I assume you're a Konsole developer. I'm pretty certain that you can come up with a definite answer, that is, either locate this feature in newest Konsole (even if the menus were rearranged), or find the commit which removed it, in no more time than it would take for me to test the newest version. Honestly, I don't quite understand why you needed this feedback from me at all. If the said menus are no longer there, could you please do the research yourself? Thanks!
Comment 5 Justin Zobel 2020-11-18 22:26:23 UTC
(In reply to Egmont Koblinger from comment #4)
> > Can you please confirm this issue still occurs in recent konsole versions.
> 
> I have version 19.12.3.
> 
> It's Settings -> Edit Current Profile -> Advanced -> Default character
> encoding.
> 
> Also right-click on the terminal -> Set Encoding.
> 
> I assume you're a Konsole developer. I'm pretty certain that you can come up
> with a definite answer, that is, either locate this feature in newest
> Konsole (even if the menus were rearranged), or find the commit which
> removed it, in no more time than it would take for me to test the newest
> version. Honestly, I don't quite understand why you needed this feedback
> from me at all. If the said menus are no longer there, could you please do
> the research yourself? Thanks!

Thank you for the update Egmont. I am not a konsole developer, I am part of the KDE Bug Triage team and we are working to confirm bugs that have been reported so that the developers can work on the fixes.

I've asked one of the developers to look in on this bug as it's a bit above my level of knowledge.
Comment 6 Kurt Hindenburg 2021-02-09 05:10:47 UTC
Konsole uses KCodecAction which uses KCodecs/KCharsets.  I'm not sure it is even possible to ask for a certain sub-set or how much extra work would be required.

Leave this BKO open; perhaps someone will have to time to research.
Comment 7 Jayadevan 2021-04-30 05:58:56 UTC
Please reject such proposals, as those are discriminatory. UTF-8 is Anglo-centric. UTF-16 treats each writing system more fairly.

Since KDE Internally uses UTF-16, UTF-16 should be supported. Also, UTF-16 is used by KDE, QT, C/C++ (From ICU), Java, Windows, JavaScript, Android, DartVM, Dart Language, and modern frameworks like Flutter. UTF-16 should get first class native support.
Comment 8 Egmont Koblinger 2021-04-30 21:49:45 UTC
(In reply to Jayadevan from comment #7)

I stopped working on terminal emulation about a year ago. Yet, I'm making a single exception here to respond (i.e. I most likely won't follow up, don't bother writing in order to expect a response from me).


> Please reject such proposals, as those are discriminatory.

I firmly refute this claim.

There is nothing discriminatory in the proposal whatsoever.

The reason behind this request – and this should be obvious to everyone who takes time to really _understand_ the post and the linked article – is that UTF-16 (and a few friends) as the _I/O_ encoding *does not work*, *never worked* and even more importantly, *cannot be fixed to work*.

More precisely, you can write a terminal emulator that speaks this encoding, but when placed in its context (i.e. surrounded by a Unix kernel, libc, higher level libraries, tools, apps, tmux-likes, other computers to ssh to/from, etc.) it won't do anything that makes sense, since all the surrounding infrastructure only support ASCII-compatible encodings for the communication with the terminal.

In order to support UTF-16 as the _I/O_ encoding, in a way that you actually get a working ecosystem around the terminal with this encoding, you'd need modifications to the kernel's tty handling (line discipline, stty special characters etc.), the kernel's tty-accessing API (to enforce UTF-16, or at least an even number of bytes on all opertaions that write to / read from a tty, or work with 16-bit units instead of 8-bit ones, in order to exclude the possibility of going out of sync, causing permanent breakages), accompanied with the corresponding changes in standards (e.g. POSIX), you'd need these changes in libc too, you'd need heavy modifications in all the apps (e.g. change from '\0'-terminated byte strings to wide strings or whatnot); you'd need to throw out any shell script that contains even an "echo foo" (in an ASCII-compatible encoding) beacuse that would outright break the terminal if sent out as-is, you'd need to rethink "cat" (how to transfer potentially odd number of bytes into a channel that expects even numbers), you'd need to add UTF-16 locales, and so on and so forth... I just sketched up a tiny subset of the problems. You'd need to essentially rethink and adjust all the APIs, libraries, every single tool or application inside the terminal, literally everything. All these in order to create a system that's utterly incompatible with what we already have, and regarding the user-visible outcome is not any tad bit better. It's clearly not going to happen, and even if happened, would be clearly harmful.

There is no politics or discrimination at all here, this is purely technical.


> UTF-8 is Anglo-centric. UTF-16 treats each writing system more fairly.

UTF-8 can represent the exact same things as UTF-16. They support all writing systems to the very same extent.

The only sense in which one can perhaps claim that UTF-8 is Anglo-centric, is that it uses 1 byte for English letters vs. 3 bytes for CJK (Chinese, Japanese, Korean) symbols; whereas UTF-16 uses 2 for both. Given that an English letter represents, well, a single letter of a word, whereas a CJK symbol represents a syllable or an entire word, I actually do think UTF-8's 1:3 split is a way more fair system. (Let alone that the typical work happening inside a terminal is usually English-centric.)

By the way: who cares? With today's network speeds, combined with the tiny amount of terminal data compared to any other activity you do over any network, the difference in the byte count just simply does not matter at all.


> Since KDE Internally uses UTF-16, UTF-16 should be supported.

Trying to make a connection between the _internal_ encoding and the _I/O_ encoding is not justified at all.

As an occasional user of Konsole I don't have the slightest idea what encoding it uses _internally_, and it should be this way. Users shouldn't care, users shouldn't need to care. If users needed to care, it would mean that the developers did a terrible job. The internal encoding is subject to change by the developers at any time, without any user noticing it.

What _I/O_ encodings Konsole supports (or, in this case: incorrectly claims to support) is an utterly independent story.


> Also, UTF-16 is used by KDE, QT, C/C++ (From ICU), Java, Windows,
> JavaScript, Android, DartVM, Dart Language, and modern frameworks
> like Flutter.

You see: they made a choice. They don't offer alternatives, they decided on one encoding.

The same goes for terminals. They decided on UTF-8; unsurprisingly, since for millions of technical reasons, the encoding needs to be ASCII-compatible, whereas there's a natural need to encode any text.

Many modern terminal emulators only support UTF-8 encoding and nothing else. Many other terminal emulators support some legacy deprecated ones for backwards compatibility, back from the days when the world hadn't settled on UTF-8, but those at least work. And then there's Konsole offering some choices that never worked, don't work, will never work due to millions of technical issues.

The direction is not to offer alternatives senselessly. Especially not if such an alternative would require to redesign and rewrite pretty much every single component of the ecosystem. The direction is one single mode of operation that is perfect for everybody. As for the terminals' _I/O_ encoding, this is UTF-8.

No culture, no language, no writing system, no human being was discriminated by this choice. The current UTF-8 approach supports everything that the UTF-16 approach, if was reasonable and feasible to implement – which it is not –, could support.

Choosing one technical solution over the other – even if the other was viable too, which is not the case here – is not discrimination. It is proper engineering.

The current bugreport is about the removal of a claimed feature that doesn't work, never worked, and cannot be made to work.
Comment 9 Jayadevan 2021-05-02 11:47:29 UTC
You said you won't respond, but for the sake of clarity for others, I have to reply.


(1) All strings should be sanitised, so that they will be perfectly safe, and will not break anything.
(2) It is racist to suggest that all non-English people are Chinese (or Japanese or Korean). Most scripts in the world are given only 3 byte encodings per character in UTF-8, and not a code point per spoken word, as you say. That is a lie.
(3) The world has still not settled on UTF-16. But modern languages and platforms tend to do so. Java, Dotnet, ICU, KDE, QT, Windows NT, JavaScript, Dart, Flutter...

In today's world, support for both the modern UTF-16 and the legacy UTF-8 is needed.
Comment 10 Jayadevan 2021-05-02 11:48:46 UTC
(In reply to Jayadevan from comment #9)
> You said you won't respond, but for the sake of clarity for others, I have
> to reply.
> 
> 
> (1) All strings should be sanitised, so that they will be perfectly safe,
> and will not break anything.
> (2) It is racist to suggest that all non-English people are Chinese (or
> Japanese or Korean). Most scripts in the world are given only 3 byte
> encodings per character in UTF-8, and not a code point per spoken word, as
> you say. That is a lie.
> (3) The world has still not settled on UTF-16. But modern languages and
> platforms tend to do so. Java, Dotnet, ICU, KDE, QT, Windows NT, JavaScript,
> Dart, Flutter...
> 
> In today's world, support for both the modern UTF-16 and the legacy UTF-8 is
> needed.

The above comment is in response to Egmont Koblinger)
Comment 11 tcanabrava 2021-05-02 12:01:26 UTC
This thread is now under Community Working Group supervision.

(1) All strings should be sanitised, so that they will be perfectly safe, and will not break anything.

You clearly are ignoring the issues pointed out by Egmond, sanitization has nothing to do with this.

(2) It is racist to suggest that all non-English people are Chinese (or Japanese or Korean). 

Please take a look at the KDE Code of Conduct, we will not tolerate accusations of racism on what as meat to be an explanation based on a example. if there is more than CJK that uses more bytes per enconding, is irrelevant.

Most scripts in the world are given only 3 byte encodings per character in UTF-8, and not a code point per spoken word, as you say. That is a lie.

(3) The world has still not settled on UTF-16. But modern languages and platforms tend to do so. Java, Dotnet, ICU, KDE, QT, Windows NT, JavaScript, Dart, Flutter...
In today's world, support for both the modern UTF-16 and the legacy UTF-8 is needed.

Patches welcome, I won't spend time working on this untill the *base software* (bash, zsh, etc) supports it.
Comment 12 Egmont Koblinger 2021-05-02 12:16:23 UTC
tcanabrava,

Thanks a lot for stepping in!

As anyone can see here, Jayadevan accused my proposal of being discriminatory, and then, when I provided purely technical arguments to disprove this claim (by the way, those technical arguments are backed up by 20+ years of understanding terminal emulators and their surrounding infrastructure; including 6+ years of being a developer of a terminal emulator (not Konsole)), the said person, obviously without even attempting to understand the tecnhical arguments, accused my words of being racist.

This behavior is utterly outrageous and unacceptable, and I believe that the said person already deserves to be banned. I was about to report this behavior, but as you say you've already taken action – thanks again for that –, I think there's nothing more I could or should do here.
Comment 13 Jayadevan 2021-05-02 19:26:03 UTC
(In reply to tcanabrava from comment #11)
> This thread is now under Community Working Group supervision.
> 
> (1) All strings should be sanitised, so that they will be perfectly safe,
> and will not break anything.
> 
> You clearly are ignoring the issues pointed out by Egmond, sanitization has
> nothing to do with this.
> 
> (2) It is racist to suggest that all non-English people are Chinese (or
> Japanese or Korean). 
> 
> Please take a look at the KDE Code of Conduct, we will not tolerate
> accusations of racism on what as meat to be an explanation based on a
> example. if there is more than CJK that uses more bytes per enconding, is
> irrelevant.
> 
> Most scripts in the world are given only 3 byte encodings per character in
> UTF-8, and not a code point per spoken word, as you say. That is a lie.
> 
> (3) The world has still not settled on UTF-16. But modern languages and
> platforms tend to do so. Java, Dotnet, ICU, KDE, QT, Windows NT, JavaScript,
> Dart, Flutter...
> In today's world, support for both the modern UTF-16 and the legacy UTF-8 is
> needed.
> 
> Patches welcome, I won't spend time working on this untill the *base
> software* (bash, zsh, etc) supports it.


He mentioned that Scripts other than English are having one code point to stand for one "syllable or an entire word". He used CJK as an example. That is a cherry-picked example to prove a wrong point. The conclusion was that 1 code point can have 3 bytes for non-Latin scripts, as they have one word per code point.

Most scripts like Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Ethiopic, Cherokee, Unified Canadian Aboriginal Syllabics, Khmer, and many others, used by billions of people are having 3 bytes per code-point, and have only one phoneme per code point, unlike he mentioned.

His cherry-picking of examples to prove a wrong point. He said "The only sense in which one can perhaps claim that UTF-8 is Anglo-centric, is that it uses 1 byte for English letters vs. 3 bytes for CJK (Chinese, Japanese, Korean) symbols; whereas UTF-16 uses 2 for both. Given that an English letter represents, well, a single letter of a word, whereas a CJK symbol represents a syllable or an entire word, I actually do think UTF-8's 1:3 split is a way more fair system." The implication is clearly that other than English (or Latin), the only scripts which matter is CJK. That is clearly inappropriate against people from South Asia, SE Asia, Cherokee, Canadian Aboriginals etc.

The scripts of South Asia, SE Asia, Cherokee, Canadian Aboriginals etc. deserve equal status as English. These scripts are used by billions of people. Claiming that "The only sense in which one can perhaps claim that UTF-8 is Anglo-centric, is that it uses 1 byte for English letters vs. 3 bytes for CJK" ignores the importance of scripts used by billions of humans. It is a factually wrong statement, and not just a case of using a bad example.
Comment 14 tcanabrava 2021-05-02 19:55:09 UTC
Jayadevan, 

As a Konsole developer, I prefer to have things based on facts as explained by Egmont. *Even* if it was just *one* issue with a language it's enough reason to not use UTF16.

*I don't care* if UTF16 makes things less anglocentric, I care that "cat" understands what's a newline and I care that grep undestands what's a "\t", not to mention pipe, <<, and other special chars that are handled by the terminal running inside of konsole.


You stepped out of line as soon as you used the 'racist' card, and *i will not* tolarate this kind of accusation. You had one warning, I urge you to not have three.
Comment 15 Egmont Koblinger 2021-05-02 20:02:24 UTC
I did indeed make a technical mistake.

For some mysterious reason, what I incorrectly had in my mind was that the boundary beyond BMP (i.e. at U+FFFF) is where UTF-8 increases from 2 bytes to 3. This is indeed not correct, this is where it increases from 3 bytes to 4. The increase from 2 bytes to 3 happens at a much lower codepoint. Therefore, indeed, there are many letter-based scripts that use 3 bytes per letter in UTF-8.

I wrote the CJK stuff with this mistake in mind. I was technically incorrect, and for this I do apologize from everyone.

* * *

That being said:

- Jayadevan still does not understand, and apparently still refuses to even try to understand, that UTF-16 cannot be made to work in terminals for a plethora of reasons, including, but not limited to the fact that all components of the system that interact with terminals only support ASCII-compatible encodings there;

- ignores that UTF-16 mode never worked; that is, firmly speaks up against removing something that, again, *never* *worked*, *still* *does* *not* *work* and *can* *not* *be* *fixed*;

- ignores that the work happening inside terminal emulators is typically English-centric by its nature;

- ignores that the difference in the byte count simply does not matter at all;

- ignores that choosing one technical solution over the other, even if that technical solution results in a higher network traffic for some languages vs. a lower one for some others, is no discrimination whatsoever (and even if it was, the right place to complain would be the Unicode Consortium, and not Konsole);

- ignores that even if terminals and their surrounding infrastructure could implement UTF-16 support (which, again, they cannot reasonably do), this would mean switching from 1 standard to 2 incompatible ones being used concurrently, which would obviously bring plenty of problems (the very exact problem that Unicode was meant to fix) and would not solve anything;

- in comments 7 & 9 used wording that are at the very least borderline unacceptable.