471483 – Problems with C1 control codes (U+0080 through U+009F)

Bug 471483 - Problems with C1 control codes (U+0080 through U+009F)

Summary: Problems with C1 control codes (U+0080 through U+009F)

Status:	REPORTED

Alias:	None

Product:	konsole
Classification:	Applications
Component:	emulation (other bugs)
Version First Reported In:	22.12.3
Platform:	Debian stable Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Konsole Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-06-26 21:20 UTC by Frank Heckenbach
Modified:	2023-11-21 17:59 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
Patch (556 bytes, patch) 2023-11-21 17:59 UTC, Frank Heckenbach	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Frank Heckenbach 2023-06-26 21:20:47 UTC

Konsole recently (apparently between versions 20 and 22) added support for 8-bit C1 control codes (U+0080 through U+009F). While formally correct, in practice it seems to cause more problems than benefits:

On the one hand, I don't know any application that actually outputs these characters. Wikipedia (https://en.wikipedia.org/wiki/C0_and_C1_control_codes) seems to agree: "the 8-bit forms of these codes are almost never used. CSI, DCS and OSC are used to control text terminals and terminal emulators, but almost always by using their 7-bit escape code representations."

On the other hand, they can actively cause problems (which contributed to their not being used much). In previous times, there were issues in not 8-bit-clean environments; these days rather with UTF-8. To quote Wikipedia again, "the UTF-8 encodings of their corresponding codepoints are two bytes long like their escape code forms (for instance, CSI at U+009B is encoded as the bytes 0xC2, 0x9B in UTF-8), so there is no advantage to using them rather than the equivalent two-byte escape sequence. When these codes appear in modern documents, web pages, e-mail messages, etc., they are usually intended to be printing characters at that position in a proprietary encoding such as Windows-1252 or Mac OS Roman that use the C1 codes to provide additional graphic characters."
... or, I'd like to add, mojibake. E.g. the German letter "ß" is U+00DF with UTF-8 encoding 0xC3 0x9F. I had a long-running program (with UTF-8 output) in a Konsole window set to ISO-8859-1 accidentally, and from the first occurrence of that letter, Konsole waited for the end of the supposedly APC sequence which never came, so it swallowed all further output including probably some important messages from the program. Sure, mojibake is not nice in general, but for languages with few non-ASCII characters such as German, quite tolerable. Swallowing all output makes matters much worse.

So I'd suggest to add at least an option to disable their handling.

STEPS TO REPRODUCE
1. Set encoding to ISO-8859-1 in Konsole window
2. Run in that window (this should be independent of shells and locale settings, though UTF-8 locale must be installed):
LC_ALL=C.UTF-8 /usr/bin/printf 'Gro\u00df\n'; echo Good

OBSERVED RESULT
GroÃ

(Output cut off and window "dead", or possibly revived by control characters in shell prompt.)

EXPECTED RESULT
GroÃ?
Good
%

(Mojibake in first line, but second line correct.)

Comment 1 Frank Heckenbach 2023-11-21 17:59:08 UTC

Created attachment 163346 [details]
Patch

The attached patch will disable 8bit C1 character handling entirely. As stated in my original report, this is unlikely to cause any problems, but if deemed necessary, it can be made optional. For that, I introduced a variable handle8bitC1 which so far is constant, but can be made configurable. (I'd suggest the default be false still.)