Bug 463848

Summary: KDE Text Encoding for Korean (applies to KWrite and SubtitleComposer in Flatpaks)
Product: [I don't know] kde Reporter: Jonathan Joseph Chiarella <j_j_chiarella>
Component: generalAssignee: Unassigned bugs mailing-list <unassigned-bugs>
Status: CONFIRMED ---    
Severity: normal CC: hein, kde, nate, nicolas.fella
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Ubuntu   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Jonathan Joseph Chiarella 2023-01-04 20:10:26 UTC
SUMMARY

The text encoding for Korean is broken/wrong on KDE software like KWrite and SubtitleComposer. The "Save As with Encoding ... EUC-KR" is actually not EUC-KR, but Unified Hangul Code/Windows-949/CP 949. The "Save As with Encoding ... CP 949" just corrupts every single non-ASCII character.

STEPS TO REPRODUCE
1.  Create a text file in Unicode (UTF-8), which is the default.
2.  Insert Korean Hangul text like `로씨써쑤쪼뢔쌰쎼쓔쬬`
3a. Save As with Encoding ... EUC-KR
or
3b. Save As with Encoding ... CP 949

OBSERVED RESULT

With EUC-KR, all of the characters `로씨써쑤쪼뢔쌰쎼쓔쬬` are present in the file. (`로씨써쑤쪼` are in EUC-KR, but `뢔쌰쎼쓔쬬` are only theoretically possible but are *not* in EUC-KR.)

With CP 949 all of the characters `로씨써쑤쪼뢔쌰쎼쓔쬬` become `??????????`. (`로씨써쑤쪼뢔쌰쎼쓔쬬` *are* all in Windows-949/CP 949/UHC.)

EXPECTED RESULT

With EUC-KR, the characters `로씨써쑤쪼뢔쌰쎼쓔쬬` should become `로씨써쑤쪼?????` or `로씨써쑤쪼` because `로씨써쑤쪼` *are* in EUC-KR, but `뢔쌰쎼쓔쬬` are *not* in EUC-KR, despite being theoretically possible arrangements of letters into pre-composed blocks.

With CP 949, the characters `로씨써쑤쪼뢔쌰쎼쓔쬬` should all be preserved as `로씨써쑤쪼뢔쌰쎼쓔쬬` because *all* are in CP 949/Windows-949/UHC.

SOFTWARE/OS VERSIONS

Latest Flatpak as of 2022-12-31, running on Linux (Ubuntu)

ADDITIONAL INFORMATION

EUC-KR *does* have `로씨써쑤쪼`, but it does *not* have `뢔쌰쎼쓔쬬` or `낥` or several other theoretical possibilities. In Korean, one types letters to form blocks. `낥` is theoretically possible. One just types `ㄴ` and `ㅏ` and `ㄹ` and `ㅌ`.  Then, the IME assembles these into the block `낥` and the computer saves this block as a pre-composed block in Unicode.

However, this syllable `낥` never occurs in any native or borrowed words. It is *not* in EUC-KR. The English/Latin script equivalent is writing "igloo" as "ig" and "loo" in pre-composed blocks. Korean usually uses its own alphabet, but with letters arranged into monospaced blocks by morpho-phonemic syllable. (Unicode also does have combining individual letters. It can store `낥` as four code points: `combining ㄴ` and `combining ㅏ` and `combining ㄹ` and combining `ㅌ`. However, Unicode included pre-composed blocks for the sake of round-trip conversion, and no IME has ever moved away from pre-composed blocks. In other words, you will always see the pre-composed blocks in real-life text.)

To correct this deficiency, Microsoft added *all* possible pre-composed Hangul blocks to a new encoding style. The cost was sacrificing true ASCII compatibility. This encoding, like Shift JIS and others, can have an ASCII byte (0xxxxxxx) as a sole byte (an ASCII character) or as the trailing byte in a two-byte character. Microsoft called its new encoding "Windows-949" or "Code Page 949" or "Unified Hangul Code (UHC)." This price was worth it to ensure that a typo character (`낥` instead of `날`) would not be lost. UTF-8 everywhere is the way to go, of course. Still, many of us need to work with the legacy encodings, especially with smart TVs. (Smart TVs and players only seem to support some form of ISO-8859-# or a variable 1-2-byte encoding.)

KDE's KWrite and SubtitleComposer as of now do use Windows-949/CP 949/UHC, but the menu option is erroneously titled `EUC-KR`. There is a menu option for CP 949 that does not work at all. This is confusing.

SUGGESTION

1. Change the behavior of the menu entry that says `EUC-KR` so that it behaves as expected and rejects characters like `낥`.
2. Make the menu entry that says `CP 949` just do what the menu entry called `EUC-KR` does right now.

OR ...

1. Change the menu entry that currently and erroneously says `EUC-KR` so that it will say `EUC-KR (Windows)` or `CP 949` or `Windows-949` or `UHC`.
2. Remove the broken menu entry that currently and erroneously claims to support CP 949.
3. Forget about true `EUC-KR` support on saving.
Comment 1 Nicolas Fella 2023-01-07 12:53:43 UTC
The problem with CP 949 is that KDE knows it as "cp 949", but Qt (which is backed by ICU) doesn't know that name. It only knows "ms949", "windows-949", and "windows-949-2000", which seem to all refer to the same codec?

This is further complicated by the fact that CP 949 doesn't seem to be registered with IANA, which means it has no mib, so any attempt to process codecs by their mib will fail
Comment 2 Shinjo Park 2023-01-07 13:18:41 UTC
As I didn't stored any Korean text in CP949 and made everything UTF-8 since decades ago, even I was not fully aware of the situation...

> This price was worth it to ensure that a typo character (`낥` instead of `날`) would not be lost.
Not only for typo or character in composition, but also for some proper names and newly invented words.

> It only knows "ms949", "windows-949", and "windows-949-2000", which seem to all refer to the same codec?
According to https://icu4c-demos.unicode.org/icu-bin/convexp?conv=windows-949 it seems true. To make thing more worse, one of the alias (KSC_5601) is shared between CP949 and EUC-KR internally in ICU (see https://icu4c-demos.unicode.org/icu-bin/convexp?conv=euc-kr). As CP949 is a superset of EUC-KR, most Korean implementations interchangeably use them, this is somewhat reflected even in the WHATWG (https://encoding.spec.whatwg.org/#names-and-labels). So I think it is just a matter of naming.
Comment 3 Bug Janitor Service 2023-01-07 14:13:43 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/kcodecs/-/merge_requests/29
Comment 4 Nicolas Fella 2023-01-07 14:15:10 UTC
(In reply to Bug Janitor Service from comment #3)
> A possibly relevant merge request was started @
> https://invent.kde.org/frameworks/kcodecs/-/merge_requests/29

This makes saving as CP 949 work. As a side effect the UI now says windows-949, unfortunately there's no separation between internal name and user-visible name here
Comment 5 Jonathan Joseph Chiarella 2023-01-07 18:09:42 UTC
From Nicloas Fella:
>> It only knows "ms949", "windows-949", and "windows-949-2000", which seem to all refer to the same codec?

From Shinjo Park:
> According to https://icu4c-demos.unicode.org/icu-bin/convexp?conv=windows-949 it seems true. To make thing more worse, one of the alias (KSC_5601) is shared between CP949 and EUC-KR internally in ICU (see https://icu4c-demos.unicode.org/icu-bin/convexp?conv=euc-kr).

For what it is worth, the WHATWG erroneously calls the Windows tweak of Shift JIS just `Shift JIS`, when the encoding in question is Windows-932/Cod Page 932/Windows-31J.

However, the IANA *does* get Japanese right: <https://www.iana.org/assignments/character-sets/character-sets.xhtml>

Thank you for all that you at KDE do. It must be very difficult managing legacy encodings and dealing with conflicting standards and old developers' choices locked into place.
Comment 6 Nicolas Fella 2023-01-08 00:25:58 UTC
Git commit 33f044ed60caac45f34fbfbb0d7a07363dac0648 by Nicolas Fella.
Committed on 07/01/2023 at 14:04.
Pushed by nicolasfella into branch 'master'.

Fix name for CP 949 in KCharsets::encodingsByScript

We know it as 'cp 949', but Qt/ICU doesn't. They know it as "windows-949", "windows-949-2000, or"ms-949".

Use windows-949 as the canonical name we use to make it compatible with QTextCodec/QStringConverter

M  +5    -5    src/kcharsets.cpp

https://invent.kde.org/frameworks/kcodecs/commit/33f044ed60caac45f34fbfbb0d7a07363dac0648
Comment 7 Nicolas Fella 2023-01-08 00:27:51 UTC
Git commit f2d5dbdb6174dc43edd631e07b9feacbe9851050 by Nicolas Fella.
Committed on 08/01/2023 at 00:27.
Pushed by nicolasfella into branch 'kf5'.

Fix name for CP 949 in KCharsets::encodingsByScript

We know it as 'cp 949', but Qt/ICU doesn't. They know it as "windows-949", "windows-949-2000, or"ms-949".

Use windows-949 as the canonical name we use to make it compatible with QTextCodec/QStringConverter
(cherry picked from commit 33f044ed60caac45f34fbfbb0d7a07363dac0648)

M  +5    -5    src/kcharsets.cpp

https://invent.kde.org/frameworks/kcodecs/commit/f2d5dbdb6174dc43edd631e07b9feacbe9851050