Summary: | Binary data with UTF BOM misdetected as plain text | ||
---|---|---|---|
Product: | [Frameworks and Libraries] frameworks-kfilemetadata | Reporter: | Malte S. Stretz <mss> |
Component: | general | Assignee: | Pinak Ahuja <pinak.ahuja> |
Status: | RESOLVED FIXED | ||
Severity: | crash | CC: | antonioni.rocha, stefan.bruens, tagwerk19 |
Priority: | NOR | Keywords: | drkonqi |
Version First Reported In: | 6.15.0 | ||
Target Milestone: | --- | ||
Platform: | Neon | ||
OS: | Linux | ||
Latest Commit: | https://invent.kde.org/frameworks/kfilemetadata/-/commit/0d63429c72a3839c36ec2134205977a44711b433 | Version Fixed In: | |
Sentry Crash Report: | |||
Attachments: |
Test case
Smaller test case |
Description
Malte S. Stretz
2025-07-04 08:14:33 UTC
According to the journal the crash is caused by this assertion:
> Jul 04 10:10:56 localhost baloo_file_extractor[10785]: ASSERT: "!term.isEmpty()" in file ./src/engine/document.cpp, line 23
(In reply to Malte S. Stretz from comment #0) > #15 0x00007a8e93b100d1 in qt_assert (assertion=assertion@entry=0x7a8e9471b263 "!term.isEmpty()", file=file@entry=0x7a8e9471b231 "./src/engine/document.cpp", line=line@entry=23) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qassert.cpp:105 > #16 0x00007a8e946f15e2 in Baloo::Document::addPositionTerm (this=<optimized out>, term=..., position=<optimized out>) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/document.cpp:21 > #17 Baloo::Document::addPositionTerm (this=<optimized out>, term=..., position=<optimized out>) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/ document.cpp:21 > #18 0x00007a8e947075e6 in Baloo::TermGenerator::indexText (this=0x7ffc29741968, text=..., prefix=...) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/termgenerator.cpp:110 Possibly an instance of Bug 506516 / Bug 506187, it looks suspiciously similar.... I'll cut and paste from https://bugs.kde.org/show_bug.cgi?id=506516#c1, it would be interesting if you've got the root cause: ... need to find out which file is causing trouble. In 506187 a UTF-16 file was the cause (a UTF-16 file that contained Chinese/Japanese scripts but that may not be a "necessary condition"). You should be able to follow what's being indexed by running "balooctl6 monitor" or by enabling logging and following the journal. You can set up logging by creating a "~/.config/QtProject/qtlogging.ini" file containing: [rules] kf.baloo=true kf.baloo.*=true kf.kfilemetadata=true and then restart Baloo, you might need: $ pkill baloo_file $ systemctl start --user kde-baloo If you are seeing the same issue, the good news is that it's not there in Neon Testing or Unstable. Thanks for the pointers. I can reproduce the issue via ``` find "~/Downloads/Sensus Update SPA (Europe)" -type f -print0 | xargs -0 balooctl6 clear find "~/Downloads/Sensus Update SPA (Europe)" -type f -print0 | xargs -0 balooctl6 check ``` Unfortunately is `balooct6 monitor` not very helpful in telling me which file had the issue but since it happens quite quickly I think I will be able to identify the file. (In reply to Malte S. Stretz from comment #3) > Unfortunately is `balooct6 monitor` not very helpful in telling me which > file had the issue but since it happens quite quickly I think I will be able > to identify the file. You might get an "Indexing: /home/whoever/whereever/whatever" followed immediately by an "Idle". If the indexing succeeds you'd get an "Ok" when it is done. You can try "balooctl6 status" and "balooctl6 failed" to see whether Baloo has recognised the crash and flagged the failure in its index. (In reply to tagwerk19 from comment #4) > (In reply to Malte S. Stretz from comment #3) > > Unfortunately is `balooct6 monitor` not very helpful in telling me which > > file had the issue but since it happens quite quickly I think I will be able > > to identify the file. > You might get an "Indexing: /home/whoever/whereever/whatever" followed > immediately by an "Idle". If the indexing succeeds you'd get an "Ok" when it > is done. > > You can try "balooctl6 status" and "balooctl6 failed" to see whether Baloo > has recognised the crash and flagged the failure in its index. Ok, the trick is to look for lines which just do not end with a `Ok` in the monitor output. I first thought they were all Ok while baloo was crashing left and right. I will attach one of the files. `file` (and I guess thus baloo, too) thinks they are UTF-32 text files but they are binary files which just happen to start with a BOM (some speech synthesis stuff, this is some quite outdated update for a Volvo navigation system). ``` # file ASYNTH/SPEECHVF/ES_SPA/FEMALE/COMPS/DEPES_MO.DAT ASYNTH/SPEECHVF/ES_SPA/FEMALE/COMPS/DEPES_MO.DAT: Unicode text, UTF-32, little-endian # hexdump -C ASYNTH/SPEECHVF/ES_SPA/FEMALE/COMPS/DEPES_MO.DAT | head 00000000 ff fe 00 00 53 43 41 4e 53 4f 46 54 64 65 70 65 |....SCANSOFTdepe| 00000010 73 00 00 00 5c 00 00 00 31 2e 30 30 00 00 00 00 |s...\...1.00....| 00000020 73 70 65 00 6d 6f 6e 69 63 61 00 00 00 00 00 00 |spe.monica......| 00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000040 00 00 00 00 01 00 00 00 66 65 5f 64 65 70 65 73 |........fe_depes| 00000050 00 00 00 00 5c 00 00 00 d8 27 00 00 0e 00 00 00 |....\....'......| 00000060 67 6c 6f 62 61 6c 00 00 02 00 00 00 04 09 00 00 |global..........| 00000070 05 00 00 00 09 00 00 00 14 00 00 00 06 00 00 00 |................| 00000080 21 00 00 00 05 00 00 00 a0 00 00 00 7c 0b 00 00 |!...........|...| 00000090 00 00 02 02 02 02 02 02 02 00 00 00 00 00 00 00 |................| ``` Created attachment 182960 [details]
Test case
(In reply to Malte S. Stretz from comment #6) > Created attachment 182960 [details] > Test case I can confirm, this crashes on a new Neon User. For some reason, I don't get a crash on Neon Testing but do on Neon Unstable. (In reply to Malte S. Stretz from comment #5) > I will attach one of the files. `file` (and I guess thus baloo, too) thinks > they are UTF-32 text files but they are binary files which just happen to > start with a BOM (some speech synthesis stuff, this is some quite outdated > update for a Volvo navigation system). iconv doesn't like it: $ file DEPES_MO.DAT DEPES_MO.DAT: Unicode text, UTF-32, little-endian $ iconv -f UTF-32 -t UTF-8 DEPES_MO.DAT iconv: illegal input sequence at position 4 The question remains how triggers an assert... (In reply to tagwerk19 from comment #7) > For some reason, I don't get a crash on Neon Testing but do on Neon Unstable. Correction... I do get a crash on Neon Testing It seems likely that this issue is the same as that for UTF-16 files. The root cause of which has been identified and described here: https://bugs.kde.org/show_bug.cgi?id=506187#c9 I'll close this as a duplicate, if you discover anything new, please reopen! *** This bug has been marked as a duplicate of bug 506187 *** Requires a more specific content detection. (In reply to tagwerk19 from comment #10) > It seems likely that this issue is the same as that for UTF-16 files... Wrong on that one, seemingly a separate issue. There's a fix in for the UTF-16 problem which should get to Neon User soon (it's there on Neon Unstable) but that fix doesn't help here. Created attachment 183031 [details]
Smaller test case
Here is a more minimalistic test case which triggers the crash.
This probably won't completely fix this issue but I think a small change could be added here https://invent.kde.org/frameworks/kfilemetadata/-/blob/master/src/extractors/plaintextextractor.cpp?ref_type=heads#L72 with a small sanity check: * The size of a file encoded in UTF-16 has to be dividable by 2 * The size of a file encoded in UTF-32 has to be dividable by 4 (In reply to Malte S. Stretz from comment #14) > * The size of a file encoded in UTF-16 has to be dividable by 2 > * The size of a file encoded in UTF-32 has to be dividable by 4 Neat :-) Git commit 0d63429c72a3839c36ec2134205977a44711b433 by Stefan Brüns. Committed on 06/07/2025 at 18:16. Pushed by bruns into branch 'master'. [PlaintextExtractor] Verify decoded text contains printable characters In case the decoded text mostly contains control characters and similar, but hardly any letters, number or punctuation, it is very likely the file actually contains fairly arbitrary binary data. This mostly happens when a file starts with a BOM, as it will be detected as text/plain by the mime database. M +33 -0 src/extractors/plaintextextractor.cpp https://invent.kde.org/frameworks/kfilemetadata/-/commit/0d63429c72a3839c36ec2134205977a44711b433 A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/241 Git commit d97f3f832f31a89f5ca4ee058043003bc1474223 by Stefan Brüns. Committed on 14/07/2025 at 12:13. Pushed by bruns into branch 'master'. [TermGenerator] Check input text validity In case the supplied text contains invalid surrogates (i.e. single low surrogates or without preceding high surrogate), the text is not valid unicode. This can also cause QString::toUtf8() to return an empty QByteArray. Related: bug 506187 M +43 -0 autotests/unit/engine/termgeneratortest.cpp M +12 -3 src/engine/termgenerator.cpp A +50 -0 src/engine/termgenerator_p.h [License: LGPL(v2.1+)] https://invent.kde.org/frameworks/baloo/-/commit/d97f3f832f31a89f5ca4ee058043003bc1474223 |