Bug 449076 - BOM-less UTF8 cannot be detected
Summary: BOM-less UTF8 cannot be detected
Status: RESOLVED FIXED
Alias: None
Product: kdiff3
Classification: Applications
Component: application (show other bugs)
Version: 1.9.4
Platform: Microsoft Windows Microsoft Windows
: NOR normal
Target Milestone: ---
Assignee: michael
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-01-24 14:00 UTC by 石庭豐
Modified: 2022-01-25 18:42 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In: 1.9.70
Sentry Crash Report:


Attachments
Observed result -- some gibberish characters (11.91 KB, image/png)
2022-01-24 14:00 UTC, 石庭豐
Details
If UTF-8 is manually chosen, no more gibberish character (76.72 KB, image/png)
2022-01-24 14:01 UTC, 石庭豐
Details

Note You need to log in before you can comment on or make changes to this bug.
Description 石庭豐 2022-01-24 14:00:53 UTC
Created attachment 145855 [details]
Observed result -- some gibberish characters

SUMMARY
By default the "Auto Detect Unicode" only works for UTF8 files with BOM. More precision needs to be added in the Help for this part, telling people to either manually change encoding or add a BOM.

PS: I haven't got the time to do test with UTF16 or UTF32 so I have no idea.

STEPS TO REPRODUCE
1. In Options > Regional Settings, make sure "Auto Detect Unicode" is checked.
2. Using files which are UTF8 but without BOM
3. Compare the files

OBSERVED RESULT
Characters which are outside 7-bit ASCII are displayed incorrectly.  Take a look at my attached image (kdiff3-bomless-utf3-observed-result.png) in which every ONE of those characters is displayed as TWO characters which is a sign that UTF8 text files is not detected correctly.

EXPECTED RESULT
Correct characters are displayed.  This will be shown in my other attached image (kdiff3-bomless-utf3-expected-result.png) IF we specifify UTF-8 instead of relying on "Auto Detect Unicode" option.

SOFTWARE/OS VERSIONS
Windows: Windows 11 (but this is irrelevant, IMO)
KDE Frameworks Version: 5.88.0
Qt Version: 

ADDITIONAL INFORMATION
This bug was previously reported in:
https://sourceforge.net/p/kdiff3/discussion/197499/thread/78e8dcc2/?limit=25#0a95
and in:
https://sourceforge.net/p/kdiff3/bugs/197/
Comment 1 石庭豐 2022-01-24 14:01:49 UTC
Created attachment 145856 [details]
If UTF-8 is manually chosen, no more gibberish character

If UTF-8 is manually chosen, no more gibberish character
Comment 2 michael 2022-01-25 18:42:09 UTC
Git commit fc59f1005f41940ca8b62d152b63b4cdf822a5c3 by Michael Reeves.
Committed on 25/01/2022 at 18:38.
Pushed by mreeves into branch 'master'.

Document "Auto Dectect Unicode".
FIXED-IN:1.9.70

M  +2    -0    doc/en/index.docbook

https://invent.kde.org/sdk/kdiff3/commit/fc59f1005f41940ca8b62d152b63b4cdf822a5c3
Comment 3 michael 2022-01-25 18:42:15 UTC
Git commit b96f5d7d36bccddea5a1bfa500a0d7436c2dbf1e by Michael Reeves.
Committed on 24/01/2022 at 23:51.
Pushed by mreeves into branch 'master'.

fix: Attempt to autodect non-bom utf-8

This is not fool proof and can't be but its better than not checking at all.
Basiclly anything that can be a utf-8 file will be interpruted as such by default if using auto detection.

M  +15   -1    src/SourceData.cpp
M  +1    -0    src/SourceData.h

https://invent.kde.org/sdk/kdiff3/commit/b96f5d7d36bccddea5a1bfa500a0d7436c2dbf1e
Comment 4 michael 2022-01-25 18:42:44 UTC
Git commit 5ee349ee95d7e1473f6fdc9edf02d0cdc3213836 by Michael Reeves.
Committed on 24/01/2022 at 23:56.
Pushed by mreeves into branch '1.9'.

fix: Attempt to autodect non-bom utf-8

This is not fool proof and can't be but its better than not checking at all.
Basiclly anything that can be a utf-8 file will be interpruted as such by default if using auto detection.
(cherry picked from commit b96f5d7d36bccddea5a1bfa500a0d7436c2dbf1e)

M  +15   -1    src/SourceData.cpp
M  +1    -0    src/SourceData.h

https://invent.kde.org/sdk/kdiff3/commit/5ee349ee95d7e1473f6fdc9edf02d0cdc3213836