Summary: | KDE Text Encoding for Korean (applies to KWrite and SubtitleComposer in Flatpaks) | ||
---|---|---|---|
Product: | [I don't know] kde | Reporter: | Jonathan Joseph Chiarella <j_j_chiarella> |
Component: | general | Assignee: | Unassigned bugs mailing-list <unassigned-bugs> |
Status: | CONFIRMED --- | ||
Severity: | normal | CC: | hein, kde, nate, nicolas.fella |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Ubuntu | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: |
Description
Jonathan Joseph Chiarella
2023-01-04 20:10:26 UTC
The problem with CP 949 is that KDE knows it as "cp 949", but Qt (which is backed by ICU) doesn't know that name. It only knows "ms949", "windows-949", and "windows-949-2000", which seem to all refer to the same codec? This is further complicated by the fact that CP 949 doesn't seem to be registered with IANA, which means it has no mib, so any attempt to process codecs by their mib will fail As I didn't stored any Korean text in CP949 and made everything UTF-8 since decades ago, even I was not fully aware of the situation... > This price was worth it to ensure that a typo character (`낥` instead of `날`) would not be lost. Not only for typo or character in composition, but also for some proper names and newly invented words. > It only knows "ms949", "windows-949", and "windows-949-2000", which seem to all refer to the same codec? According to https://icu4c-demos.unicode.org/icu-bin/convexp?conv=windows-949 it seems true. To make thing more worse, one of the alias (KSC_5601) is shared between CP949 and EUC-KR internally in ICU (see https://icu4c-demos.unicode.org/icu-bin/convexp?conv=euc-kr). As CP949 is a superset of EUC-KR, most Korean implementations interchangeably use them, this is somewhat reflected even in the WHATWG (https://encoding.spec.whatwg.org/#names-and-labels). So I think it is just a matter of naming. A possibly relevant merge request was started @ https://invent.kde.org/frameworks/kcodecs/-/merge_requests/29 (In reply to Bug Janitor Service from comment #3) > A possibly relevant merge request was started @ > https://invent.kde.org/frameworks/kcodecs/-/merge_requests/29 This makes saving as CP 949 work. As a side effect the UI now says windows-949, unfortunately there's no separation between internal name and user-visible name here From Nicloas Fella: >> It only knows "ms949", "windows-949", and "windows-949-2000", which seem to all refer to the same codec? From Shinjo Park: > According to https://icu4c-demos.unicode.org/icu-bin/convexp?conv=windows-949 it seems true. To make thing more worse, one of the alias (KSC_5601) is shared between CP949 and EUC-KR internally in ICU (see https://icu4c-demos.unicode.org/icu-bin/convexp?conv=euc-kr). For what it is worth, the WHATWG erroneously calls the Windows tweak of Shift JIS just `Shift JIS`, when the encoding in question is Windows-932/Cod Page 932/Windows-31J. However, the IANA *does* get Japanese right: <https://www.iana.org/assignments/character-sets/character-sets.xhtml> Thank you for all that you at KDE do. It must be very difficult managing legacy encodings and dealing with conflicting standards and old developers' choices locked into place. Git commit 33f044ed60caac45f34fbfbb0d7a07363dac0648 by Nicolas Fella. Committed on 07/01/2023 at 14:04. Pushed by nicolasfella into branch 'master'. Fix name for CP 949 in KCharsets::encodingsByScript We know it as 'cp 949', but Qt/ICU doesn't. They know it as "windows-949", "windows-949-2000, or"ms-949". Use windows-949 as the canonical name we use to make it compatible with QTextCodec/QStringConverter M +5 -5 src/kcharsets.cpp https://invent.kde.org/frameworks/kcodecs/commit/33f044ed60caac45f34fbfbb0d7a07363dac0648 Git commit f2d5dbdb6174dc43edd631e07b9feacbe9851050 by Nicolas Fella. Committed on 08/01/2023 at 00:27. Pushed by nicolasfella into branch 'kf5'. Fix name for CP 949 in KCharsets::encodingsByScript We know it as 'cp 949', but Qt/ICU doesn't. They know it as "windows-949", "windows-949-2000, or"ms-949". Use windows-949 as the canonical name we use to make it compatible with QTextCodec/QStringConverter (cherry picked from commit 33f044ed60caac45f34fbfbb0d7a07363dac0648) M +5 -5 src/kcharsets.cpp https://invent.kde.org/frameworks/kcodecs/commit/f2d5dbdb6174dc43edd631e07b9feacbe9851050 |