Bug 449076

Summary: BOM-less UTF8 cannot be detected
Product: [Applications] kdiff3 Reporter: 石庭豐 <lapsap7+kde>
Component: applicationAssignee: michael <reeves.87>
Status: RESOLVED FIXED    
Severity: normal    
Priority: NOR    
Version: 1.9.4   
Target Milestone: ---   
Platform: Microsoft Windows   
OS: Microsoft Windows   
Latest Commit: Version Fixed In: 1.9.70
Sentry Crash Report:
Attachments: Observed result -- some gibberish characters
If UTF-8 is manually chosen, no more gibberish character

Description 石庭豐 2022-01-24 14:00:53 UTC
Created attachment 145855 [details]
Observed result -- some gibberish characters

SUMMARY
By default the "Auto Detect Unicode" only works for UTF8 files with BOM. More precision needs to be added in the Help for this part, telling people to either manually change encoding or add a BOM.

PS: I haven't got the time to do test with UTF16 or UTF32 so I have no idea.

STEPS TO REPRODUCE
1. In Options > Regional Settings, make sure "Auto Detect Unicode" is checked.
2. Using files which are UTF8 but without BOM
3. Compare the files

OBSERVED RESULT
Characters which are outside 7-bit ASCII are displayed incorrectly.  Take a look at my attached image (kdiff3-bomless-utf3-observed-result.png) in which every ONE of those characters is displayed as TWO characters which is a sign that UTF8 text files is not detected correctly.

EXPECTED RESULT
Correct characters are displayed.  This will be shown in my other attached image (kdiff3-bomless-utf3-expected-result.png) IF we specifify UTF-8 instead of relying on "Auto Detect Unicode" option.

SOFTWARE/OS VERSIONS
Windows: Windows 11 (but this is irrelevant, IMO)
KDE Frameworks Version: 5.88.0
Qt Version: 

ADDITIONAL INFORMATION
This bug was previously reported in:
https://sourceforge.net/p/kdiff3/discussion/197499/thread/78e8dcc2/?limit=25#0a95
and in:
https://sourceforge.net/p/kdiff3/bugs/197/
Comment 1 石庭豐 2022-01-24 14:01:49 UTC
Created attachment 145856 [details]
If UTF-8 is manually chosen, no more gibberish character

If UTF-8 is manually chosen, no more gibberish character
Comment 2 michael 2022-01-25 18:42:09 UTC
Git commit fc59f1005f41940ca8b62d152b63b4cdf822a5c3 by Michael Reeves.
Committed on 25/01/2022 at 18:38.
Pushed by mreeves into branch 'master'.

Document "Auto Dectect Unicode".
FIXED-IN:1.9.70

M  +2    -0    doc/en/index.docbook

https://invent.kde.org/sdk/kdiff3/commit/fc59f1005f41940ca8b62d152b63b4cdf822a5c3
Comment 3 michael 2022-01-25 18:42:15 UTC
Git commit b96f5d7d36bccddea5a1bfa500a0d7436c2dbf1e by Michael Reeves.
Committed on 24/01/2022 at 23:51.
Pushed by mreeves into branch 'master'.

fix: Attempt to autodect non-bom utf-8

This is not fool proof and can't be but its better than not checking at all.
Basiclly anything that can be a utf-8 file will be interpruted as such by default if using auto detection.

M  +15   -1    src/SourceData.cpp
M  +1    -0    src/SourceData.h

https://invent.kde.org/sdk/kdiff3/commit/b96f5d7d36bccddea5a1bfa500a0d7436c2dbf1e
Comment 4 michael 2022-01-25 18:42:44 UTC
Git commit 5ee349ee95d7e1473f6fdc9edf02d0cdc3213836 by Michael Reeves.
Committed on 24/01/2022 at 23:56.
Pushed by mreeves into branch '1.9'.

fix: Attempt to autodect non-bom utf-8

This is not fool proof and can't be but its better than not checking at all.
Basiclly anything that can be a utf-8 file will be interpruted as such by default if using auto detection.
(cherry picked from commit b96f5d7d36bccddea5a1bfa500a0d7436c2dbf1e)

M  +15   -1    src/SourceData.cpp
M  +1    -0    src/SourceData.h

https://invent.kde.org/sdk/kdiff3/commit/5ee349ee95d7e1473f6fdc9edf02d0cdc3213836