| Summary: | KEncodingProber misdetects short UTF-8 text as Shift_JIS or gb18030 with high confidence=0.99 | ||
|---|---|---|---|
| Product: | [Frameworks and Libraries] frameworks-kcodecs | Reporter: | Igor Kushnir <igorkuo> |
| Component: | general | Assignee: | kdelibs bugs <kdelibs-bugs-null> |
| Status: | REPORTED --- | ||
| Severity: | normal | CC: | fanzhuyifan |
| Priority: | NOR | ||
| Version First Reported In: | 6.5.0 | ||
| Target Milestone: | --- | ||
| Platform: | Compiled Sources | ||
| OS: | Linux | ||
| Latest Commit: | Version Fixed/Implemented In: | ||
| Sentry Crash Report: | |||
| Attachments: | Patch for kencodingprobertest.cpp that demonstrates the bug | ||
|
Description
Igor Kushnir
2024-08-17 12:50:05 UTC
OBSERVED RESULT Step 2: QDEBUG : KEncodingProberTest::testProbe() Text: "Этот" QDEBUG : KEncodingProberTest::testProbe() state: 2 confidence: 0.99 encoding: "Shift_JIS" XFAIL : KEncodingProberTest::testProbe() KEncodingProber misdetects short UTF-8 text as Shift_JIS or gb18030 Loc: [kcodecs/autotests/kencodingprobertest.cpp(54)] QDEBUG : KEncodingProberTest::testProbe() state: 2 confidence: 0.277778 encoding: "gb18030" XFAIL : KEncodingProberTest::testProbe() KEncodingProber::reset() leaves behind earlier fed data, so this is detected as gb18030 now Loc: [kcodecs/autotests/kencodingprobertest.cpp(60)] QDEBUG : KEncodingProberTest::testProbe() state: 2 confidence: 0.444444 encoding: "gb18030" XFAIL : KEncodingProberTest::testProbe() KEncodingProber::reset() leaves behind earlier fed data, so the confidence is lower now Loc: [kcodecs/autotests/kencodingprobertest.cpp(66)] Step 4: same as step 2, except that "Shift_JIS" is replaced by "gb18030" at the end of the second line of the posted output. EXPECTED RESULT A. KEncodingProber detects the encoding of the short UTF-8 text as UTF-8. Or as some other encoding but with a confidence much lower than 0.99. B. The documented and actual behavior of KEncodingProber::reset() match. |