Bug 506187 - baloo_file_extractor crashes on attempting to index specific files and spams tray notifications
Summary: baloo_file_extractor crashes on attempting to index specific files and spams ...
Status: RESOLVED FIXED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Baloo File Daemon (other bugs)
Version First Reported In: unspecified
Platform: Neon Linux
: NOR crash
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
: 506516 506598 506608 (view as bug list)
Depends on:
Blocks:
 
Reported: 2025-06-25 22:45 UTC by Garirry
Modified: 2025-07-14 20:15 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:


Attachments
One CSV file that causes the crash (4.42 KB, text/csv)
2025-06-25 22:45 UTC, Garirry
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Garirry 2025-06-25 22:45:43 UTC
Created attachment 182669 [details]
One CSV file that causes the crash

Running the up-to-date version of KDE neon user.
KDE Plasma Version: 6.4.1
KDE Frameworks Version: 6.15.0
Qt Version: 6.9.0

Baloo crashes on a specific set of files. These are .csv files. I have attached one for reference. The error that the monitor gives is the following:

ASSERT: "!term.isEmpty()" in file ./src/engine/document.cpp, line 23
KCrash: Application 'baloo_file_extractor' crashing... crashRecursionCounter = 2
kf.baloo: Extractor crashed

Backtrace is the following:
Module libgomp.so.1 from deb gcc-14-14.2.0-4ubuntu2~24.04.amd64
                Module libzstd.so.1 from deb libzstd-1.5.5+dfsg2-2build1.1.amd64
                Module libsystemd.so.0 from deb systemd-255.4-1ubuntu8.8.amd64
                Module libgcc_s.so.1 from deb gcc-14-14.2.0-4ubuntu2~24.04.amd64
                Module libudev.so.1 from deb systemd-255.4-1ubuntu8.8.amd64
                Module libstdc++.so.6 from deb gcc-14-14.2.0-4ubuntu2~24.04.amd64
                Stack trace of thread 36272:
                #0  0x0000750c0d89eb2c __pthread_kill_implementation (libc.so.6 + 0x9eb2c)
                #1  0x0000750c0d84527e __GI_raise (libc.so.6 + 0x4527e)
                #2  0x0000750c0f06c33e _ZN6KCrash19defaultCrashHandlerEi (libKF6Crash.so.6 + 0x933e)
                #3  0x0000750c0d845330 __restore_rt (libc.so.6 + 0x45330)
                #4  0x0000750c0d89eb2c __pthread_kill_implementation (libc.so.6 + 0x9eb2c)
                #5  0x0000750c0d84527e __GI_raise (libc.so.6 + 0x4527e)
                #6  0x0000750c0d8288ff __GI_abort (libc.so.6 + 0x288ff)
                #7  0x0000750c0e5247d9 n/a (libQt6Core.so.6 + 0x5247d9)
                #8  0x0000750c0e525ac7 _ZNK14QMessageLogger5fatalEPKcz (libQt6Core.so.6 + 0x525ac7)
                #9  0x0000750c0e5100d1 _Z9qt_assertPKcS0_i (libQt6Core.so.6 + 0x5100d1)
                #10 0x0000750c0f0805e2 n/a (libKF6BalooEngine.so.6 + 0xa5e2)
                #11 0x0000750c0f0965e6 _ZN5Baloo13TermGenerator9indexTextERK7QStringRK10QByteArray (libKF6BalooEngine.so.6 + 0x205e6)
                #12 0x0000750c0f0966d4 _ZN5Baloo13TermGenerator9indexTextERK7QString (libKF6BalooEngine.so.6 + 0x206d4)
                #13 0x0000750c0a24e1f9 n/a (kfilemetadata_plaintextextractor.so + 0x31f9)
                #14 0x000056952e1d446f n/a (baloo_file_extractor + 0x1746f)
                #15 0x000056952e1d5e6d n/a (baloo_file_extractor + 0x18e6d)
                #16 0x0000750c0e477a99 n/a (libQt6Core.so.6 + 0x477a99)
                #17 0x0000750c0e417d4d n/a (libQt6Core.so.6 + 0x417d4d)
                #18 0x0000750c0e400ae6 _ZN7QObject5eventEP6QEvent (libQt6Core.so.6 + 0x400ae6)
                #19 0x0000750c0e4b0dd0 _ZN16QCoreApplication15notifyInternal2EP7QObjectP6QEvent (libQt6Core.so.6 + 0x4b0dd0)
                #20 0x0000750c0e38f087 _ZN14QTimerInfoList14activateTimersEv (libQt6Core.so.6 + 0x38f087)
                #21 0x0000750c0e261d99 n/a (libQt6Core.so.6 + 0x261d99)
                #22 0x0000750c0d5d75c5 n/a (libglib-2.0.so.0 + 0x5d5c5)
                #23 0x0000750c0d636737 n/a (libglib-2.0.so.0 + 0xbc737)
                #24 0x0000750c0d5d6a63 g_main_context_iteration (libglib-2.0.so.0 + 0x5ca63)
                #25 0x0000750c0e260b3f _ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE (libQt6Core.so.6 + 0x260b3f)
                #26 0x0000750c0e4bb4bb _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt6Core.so.6 + 0x4bb4bb)
                #27 0x0000750c0e4b405f _ZN16QCoreApplication4execEv (libQt6Core.so.6 + 0x4b405f)
                #28 0x000056952e1cac94 n/a (baloo_file_extractor + 0xdc94)
                #29 0x0000750c0d82a1ca __libc_start_call_main (libc.so.6 + 0x2a1ca)
                #30 0x0000750c0d82a28b __libc_start_main_impl (libc.so.6 + 0x2a28b)
                #31 0x000056952e1cadd5 n/a (baloo_file_extractor + 0xddd5)
                
                Stack trace of thread 36277:
                #0  0x0000750c0d91b4cd __GI___poll (libc.so.6 + 0x11b4cd)
                #1  0x0000750c0d63668e n/a (libglib-2.0.so.0 + 0xbc68e)
                #2  0x0000750c0d5d6a63 g_main_context_iteration (libglib-2.0.so.0 + 0x5ca63)
                #3  0x0000750c0e260b3f _ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE (libQt6Core.so.6 + 0x260b3f)
                #4  0x0000750c0e4bb4bb _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt6Core.so.6 + 0x4bb4bb)
                #5  0x0000750c0e3c9627 _ZN7QThread4execEv (libQt6Core.so.6 + 0x3c9627)
                #6  0x0000750c0dbe1b3d n/a (libQt6DBus.so.6 + 0x9bb3d)
                #7  0x0000750c0e38fa39 n/a (libQt6Core.so.6 + 0x38fa39)
                #8  0x0000750c0d89caa4 start_thread (libc.so.6 + 0x9caa4)
                #9  0x0000750c0d929c3c __clone3 (libc.so.6 + 0x129c3c)
                
                Stack trace of thread 36278:
                #0  0x0000750c0d91b4cd __GI___poll (libc.so.6 + 0x11b4cd)
                #1  0x0000750c0c9398ca n/a (libxcb.so.1 + 0xc8ca)
                #2  0x0000750c0c93b28c xcb_wait_for_event (libxcb.so.1 + 0xe28c)
                #3  0x0000750c0a20b056 n/a (libQt6XcbQpa.so.6 + 0x64056)
                #4  0x0000750c0e38fa39 n/a (libQt6Core.so.6 + 0x38fa39)
                #5  0x0000750c0d89caa4 start_thread (libc.so.6 + 0x9caa4)
                #6  0x0000750c0d929c3c __clone3 (libc.so.6 + 0x129c3c)
                ELF object binary architecture: AMD x86-64

In addition, as there are multiple files, when it attempts to scan each one it spams my tray with crashed process messages which is really annoying. It now does this every time I login as well.

This issue just started a few days ago, although the files themselves have existed in the same location for years and I do not recall this particular crash.
Comment 1 Garirry 2025-06-26 05:20:21 UTC
After further rebuilding the entire index, I can now add that those .csv files are not specifically the culprit, as there are many more that cause the exact same type of crash. The only thing that they have in common is that they all have text encoded as UTF-16.
Comment 2 tagwerk19 2025-06-26 08:04:06 UTC
(In reply to Garirry from comment #0)
>  #13 0x0000750c0a24e1f9 n/a (kfilemetadata_plaintextextractor.so + 0x31f9)
If I run the same file on a more-or-less scratch system (Neon Unstable), I see Baloo deciding to use the plain text extractor (using inherited mimetype...) and then successfully indexing the file. It is an empty index though.

> ASSERT: "!term.isEmpty()" in file ./src/engine/document.cpp, line 23
OK... that's fairly clear

Does the same happen if you have the file in a folder of its own and just index that folder? (You can close down baloo and rename the .local/share/baloo/index file to keep it save)
Comment 3 Garirry 2025-06-26 19:15:22 UTC
(In reply to tagwerk19 from comment #2)
> Does the same happen if you have the file in a folder of its own and just
> index that folder? (You can close down baloo and rename the
> .local/share/baloo/index file to keep it save)

Yes, the exact same error occurs. 

The files which cause the crash do so consistently, if I don't exclude them then baloo scans them again on system boot and crashes for each file.
Comment 4 tagwerk19 2025-06-28 20:52:34 UTC
(In reply to Garirry from comment #3)
> The files which cause the crash do so consistently, if I don't exclude them
> then baloo scans them again on system boot and crashes for each file.
I'm afraid don't really have an idea here... You are using CJK - Chinese? I apologise for not being familiar.

I don't get a crash but I do find that if I index the file and check with "balooshow6 -x home.csv", I get a completely different set of plain text terms on a Neon User (dodgy) compared to a Neon Unstable (more sensible)

As a marker, we've also had a recent Bug 505968 where there is some strange behaviour with CJK. https://bugs.kde.org/show_bug.cgi?id=505968#c2
Comment 5 Garirry 2025-06-28 22:44:13 UTC
(In reply to tagwerk19 from comment #4)
> I'm afraid don't really have an idea here... You are using CJK - Chinese? I
> apologise for not being familiar.
No I don't use CJK generally. Although speaking of that, I do know that many if not all of the files that are affected, if opened in an editor like KWrite display CJK characters, and it detects an encoding of UTF-16. If that would help I could upload more file samples.
Comment 6 tagwerk19 2025-06-29 07:43:40 UTC
(In reply to Garirry from comment #5)
> No I don't use CJK generally. Although speaking of that, I do know that many
> if not all of the files that are affected, if opened in an editor like
> KWrite display CJK characters, and it detects an encoding of UTF-16.
If I look at your uploaded "home.csv" (in Libreoffice Calc) it looks like a set of translations - 10 languages that include Japanese and Chinese scripts (plus English, German, French etc, etc and etc)

(In reply to tagwerk19 from comment #4)
> I don't get a crash ....
... I've just tried a clean install of Neon User. I now see a crash.

> ... a completely different set of plain text terms on a Neon User (dodgy) ...
Best discard that result, it was on a system with a custom locale (it had LC_TIME=en_SE.UTF-8 to get ISO format short dates - maybe that's too wierd...)

So, I can flag "Confirmed" but don't really know where it goes from here (on the basis that I don't get the crash on Neon Unstable or Neon testing). Summarising what I see...

Neon User
    Plasma: 6.4.1
    Frameworks: 6.15.0
    Qt: 6.9.0
    Wayland
Crashes

Neon Testing:
    Plasma: 6.4.1
    Frameworks: 6.16.0
    Qt: 6.9.0
    Wayland
Seems OK

Neon Unstable:
    Plasma: 6.4.80
    Frameworks: 6.16.0
    Qt: 6.9.0
    Wayland
Seems OK
Comment 7 tagwerk19 2025-06-30 11:03:48 UTC
To tidy the UTF-16 loose end, converting the file from UTF-16 to UTF-8 with

    $  iconv -f UTF-16 -t UTF-8 home.csv > home2.csv

Baloo can read and index it.
Comment 8 tagwerk19 2025-07-04 10:13:25 UTC
*** Bug 506516 has been marked as a duplicate of this bug. ***
Comment 9 Stefan Brüns 2025-07-05 18:33:25 UTC
This seems to be a cascade of bugs/implementation errors, finally triggering the assert.

- The KFileMetaData plaintext extractor uses QIODevice::readline, although this is not supported for 16bit encodings (see https://bugreports.qt.io/browse/QTBUG-121812)
- The split code returns a term QString which only contains invalid unicode code points
- QString::toUtf8() returns an empty QByteArray
Comment 10 tagwerk19 2025-07-05 19:50:39 UTC
(In reply to Stefan Brüns from comment #9)
> This seems to be a cascade of bugs/implementation errors, finally triggering
> the assert.
I find a couple of things confusing:

*  This has suddenly started happening, with several very similar bugs. 
*  All appear on Neon User, the test case we have works on Neon Testing and Unstable.

> - The KFileMetaData plaintext extractor uses QIODevice::readline, although
> this is not supported for 16bit encodings (see
> https://bugreports.qt.io/browse/QTBUG-121812)
> - The split code returns a term QString which only contains invalid unicode
> code points
> - QString::toUtf8() returns an empty QByteArray
That would explain why if you convert the file to UTF-8 with iconv, Baloo is happy
Comment 11 tagwerk19 2025-07-06 11:58:13 UTC
*** Bug 506598 has been marked as a duplicate of this bug. ***
Comment 12 tagwerk19 2025-07-06 12:01:32 UTC
*** Bug 506608 has been marked as a duplicate of this bug. ***
Comment 13 tagwerk19 2025-07-06 12:50:00 UTC
*** Bug 506570 has been marked as a duplicate of this bug. ***
Comment 14 tagwerk19 2025-07-07 08:29:52 UTC
There's a fix for the UTF-16 issue here:
    https://invent.kde.org/frameworks/kfilemetadata/-/merge_requests/193
Thank you Stefan!

That's just landed on Neon Unstable. I don't know how long "due course" is but if it's on Neon Unstable it will arrive on Neon User in "due course" :-)

This doesn't address Bug 506570, a binary file that says it's UTF-32, that seems a different issue
Comment 15 Stefan Brüns 2025-07-07 15:06:56 UTC
Git commit 9fa1aaaf4a841224161e791cb8ffd366485dc7e3 by Stefan Brüns.
Committed on 06/07/2025 at 18:16.
Pushed by bruns into branch 'master'.

[PlaintextExtractor] Fix various issues with UTF-16

Read the file in binary mode, feed the complete data into QStringDecoder
with the detected encoding, and split the lines last.

Opening a file with open mode "QIODevice::Text" mangles Carriage Return
sequences, and the UTF16-LE sequence "\r\0\n\0" ends up as "\0\n\0", i.e.
an invalid sequence.

QIODevice::readline() only supports 8 bit encodings (see QTBUG 121812),
and the fixup attempts here were not working in general.

Unfortunately, QTextStream::setEncoding only supports UTF encodings,
but none of the legacy ISO-8859 or Windows encodings or e.g. GB18030.

M  +0    -2    autotests/indexerextractortests.cpp
M  +53   -25   src/extractors/plaintextextractor.cpp

https://invent.kde.org/frameworks/kfilemetadata/-/commit/9fa1aaaf4a841224161e791cb8ffd366485dc7e3
Comment 16 Bug Janitor Service 2025-07-13 12:25:07 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/241
Comment 17 Stefan Brüns 2025-07-14 20:15:49 UTC
Git commit d97f3f832f31a89f5ca4ee058043003bc1474223 by Stefan Brüns.
Committed on 14/07/2025 at 12:13.
Pushed by bruns into branch 'master'.

[TermGenerator] Check input text validity

In case the supplied text contains invalid surrogates (i.e. single
low surrogates or without preceding high surrogate), the text is not
valid unicode. This can also cause QString::toUtf8() to return an
empty QByteArray.
Related: bug 506570

M  +43   -0    autotests/unit/engine/termgeneratortest.cpp
M  +12   -3    src/engine/termgenerator.cpp
A  +50   -0    src/engine/termgenerator_p.h     [License: LGPL(v2.1+)]

https://invent.kde.org/frameworks/baloo/-/commit/d97f3f832f31a89f5ca4ee058043003bc1474223