491824 – KEncodingProber misdetects short UTF-8 text as Shift_JIS or gb18030 with high confidence=0.99

Bug 491824 - KEncodingProber misdetects short UTF-8 text as Shift_JIS or gb18030 with high confidence=0.99

Summary: KEncodingProber misdetects short UTF-8 text as Shift_JIS or gb18030 with high...

Status:	REPORTED

Alias:	None

Product:	frameworks-kcodecs
Classification:	Frameworks and Libraries
Component:	general (other bugs)
Version First Reported In:	6.5.0
Platform:	Compiled Sources Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	kdelibs bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-08-17 12:50 UTC by Igor Kushnir
Modified:	2024-08-17 17:54 UTC (History)
CC List:	1 user (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
Patch for kencodingprobertest.cpp that demonstrates the bug (1.92 KB, patch) 2024-08-17 12:50 UTC, Igor Kushnir	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Igor Kushnir 2024-08-17 12:50:05 UTC

Created attachment 172704 [details]
Patch for kencodingprobertest.cpp that demonstrates the bug

SUMMARY
KEncodingProber detects UTF-8-encoded 4 Russian characters (8 bytes) as Shift_JIS with confidence=0.99. Appending the end-of-line character '\n' to these 8 bytes makes KEncodingProber detect this short text as gb18030 with the same high confidence of 0.99. This issue was discovered while testing KDevelop's single use of KEncodingProber: https://invent.kde.org/kdevelop/kdevelop/-/issues/71#note_1007105

Also KEncodingProber::reset() leaves behind earlier fed data. It is documented as "reset the prober's internal state and data." So either reset()'s behavior is wrong or the documentation misleading.

STEPS TO REPRODUCE
1. Download the attached patch, apply it to kcodecs and build.
2. Run the following command from the build directory of kcodecs: QT_LOGGING_RULES='default.debug=true' ./bin/kencodingprobertest
3. Replace `#if 1` with `#if 0` in the code added by the patch and rebuild kcodecs.
4. Repeat step 2.
5. Read and compare the two test run outputs and the patch itself.

Comment 1 Igor Kushnir 2024-08-17 16:10:03 UTC

OBSERVED RESULT
Step 2:
QDEBUG : KEncodingProberTest::testProbe() Text: "Этот"
QDEBUG : KEncodingProberTest::testProbe() state: 2 confidence: 0.99 encoding: "Shift_JIS"
XFAIL  : KEncodingProberTest::testProbe() KEncodingProber misdetects short UTF-8 text as Shift_JIS or gb18030
   Loc: [kcodecs/autotests/kencodingprobertest.cpp(54)]
QDEBUG : KEncodingProberTest::testProbe() state: 2 confidence: 0.277778 encoding: "gb18030"
XFAIL  : KEncodingProberTest::testProbe() KEncodingProber::reset() leaves behind earlier fed data, so this is detected as gb18030 now
   Loc: [kcodecs/autotests/kencodingprobertest.cpp(60)]
QDEBUG : KEncodingProberTest::testProbe() state: 2 confidence: 0.444444 encoding: "gb18030"
XFAIL  : KEncodingProberTest::testProbe() KEncodingProber::reset() leaves behind earlier fed data, so the confidence is lower now
   Loc: [kcodecs/autotests/kencodingprobertest.cpp(66)]

Step 4: same as step 2, except that "Shift_JIS" is replaced by "gb18030" at the end of the second line of the posted output.

EXPECTED RESULT
A. KEncodingProber detects the encoding of the short UTF-8 text as UTF-8. Or as some other encoding but with a confidence much lower than 0.99.
B. The documented and actual behavior of KEncodingProber::reset() match.