Created attachment 182669 [details] One CSV file that causes the crash Running the up-to-date version of KDE neon user. KDE Plasma Version: 6.4.1 KDE Frameworks Version: 6.15.0 Qt Version: 6.9.0 Baloo crashes on a specific set of files. These are .csv files. I have attached one for reference. The error that the monitor gives is the following: ASSERT: "!term.isEmpty()" in file ./src/engine/document.cpp, line 23 KCrash: Application 'baloo_file_extractor' crashing... crashRecursionCounter = 2 kf.baloo: Extractor crashed Backtrace is the following: Module libgomp.so.1 from deb gcc-14-14.2.0-4ubuntu2~24.04.amd64 Module libzstd.so.1 from deb libzstd-1.5.5+dfsg2-2build1.1.amd64 Module libsystemd.so.0 from deb systemd-255.4-1ubuntu8.8.amd64 Module libgcc_s.so.1 from deb gcc-14-14.2.0-4ubuntu2~24.04.amd64 Module libudev.so.1 from deb systemd-255.4-1ubuntu8.8.amd64 Module libstdc++.so.6 from deb gcc-14-14.2.0-4ubuntu2~24.04.amd64 Stack trace of thread 36272: #0 0x0000750c0d89eb2c __pthread_kill_implementation (libc.so.6 + 0x9eb2c) #1 0x0000750c0d84527e __GI_raise (libc.so.6 + 0x4527e) #2 0x0000750c0f06c33e _ZN6KCrash19defaultCrashHandlerEi (libKF6Crash.so.6 + 0x933e) #3 0x0000750c0d845330 __restore_rt (libc.so.6 + 0x45330) #4 0x0000750c0d89eb2c __pthread_kill_implementation (libc.so.6 + 0x9eb2c) #5 0x0000750c0d84527e __GI_raise (libc.so.6 + 0x4527e) #6 0x0000750c0d8288ff __GI_abort (libc.so.6 + 0x288ff) #7 0x0000750c0e5247d9 n/a (libQt6Core.so.6 + 0x5247d9) #8 0x0000750c0e525ac7 _ZNK14QMessageLogger5fatalEPKcz (libQt6Core.so.6 + 0x525ac7) #9 0x0000750c0e5100d1 _Z9qt_assertPKcS0_i (libQt6Core.so.6 + 0x5100d1) #10 0x0000750c0f0805e2 n/a (libKF6BalooEngine.so.6 + 0xa5e2) #11 0x0000750c0f0965e6 _ZN5Baloo13TermGenerator9indexTextERK7QStringRK10QByteArray (libKF6BalooEngine.so.6 + 0x205e6) #12 0x0000750c0f0966d4 _ZN5Baloo13TermGenerator9indexTextERK7QString (libKF6BalooEngine.so.6 + 0x206d4) #13 0x0000750c0a24e1f9 n/a (kfilemetadata_plaintextextractor.so + 0x31f9) #14 0x000056952e1d446f n/a (baloo_file_extractor + 0x1746f) #15 0x000056952e1d5e6d n/a (baloo_file_extractor + 0x18e6d) #16 0x0000750c0e477a99 n/a (libQt6Core.so.6 + 0x477a99) #17 0x0000750c0e417d4d n/a (libQt6Core.so.6 + 0x417d4d) #18 0x0000750c0e400ae6 _ZN7QObject5eventEP6QEvent (libQt6Core.so.6 + 0x400ae6) #19 0x0000750c0e4b0dd0 _ZN16QCoreApplication15notifyInternal2EP7QObjectP6QEvent (libQt6Core.so.6 + 0x4b0dd0) #20 0x0000750c0e38f087 _ZN14QTimerInfoList14activateTimersEv (libQt6Core.so.6 + 0x38f087) #21 0x0000750c0e261d99 n/a (libQt6Core.so.6 + 0x261d99) #22 0x0000750c0d5d75c5 n/a (libglib-2.0.so.0 + 0x5d5c5) #23 0x0000750c0d636737 n/a (libglib-2.0.so.0 + 0xbc737) #24 0x0000750c0d5d6a63 g_main_context_iteration (libglib-2.0.so.0 + 0x5ca63) #25 0x0000750c0e260b3f _ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE (libQt6Core.so.6 + 0x260b3f) #26 0x0000750c0e4bb4bb _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt6Core.so.6 + 0x4bb4bb) #27 0x0000750c0e4b405f _ZN16QCoreApplication4execEv (libQt6Core.so.6 + 0x4b405f) #28 0x000056952e1cac94 n/a (baloo_file_extractor + 0xdc94) #29 0x0000750c0d82a1ca __libc_start_call_main (libc.so.6 + 0x2a1ca) #30 0x0000750c0d82a28b __libc_start_main_impl (libc.so.6 + 0x2a28b) #31 0x000056952e1cadd5 n/a (baloo_file_extractor + 0xddd5) Stack trace of thread 36277: #0 0x0000750c0d91b4cd __GI___poll (libc.so.6 + 0x11b4cd) #1 0x0000750c0d63668e n/a (libglib-2.0.so.0 + 0xbc68e) #2 0x0000750c0d5d6a63 g_main_context_iteration (libglib-2.0.so.0 + 0x5ca63) #3 0x0000750c0e260b3f _ZN20QEventDispatcherGlib13processEventsE6QFlagsIN10QEventLoop17ProcessEventsFlagEE (libQt6Core.so.6 + 0x260b3f) #4 0x0000750c0e4bb4bb _ZN10QEventLoop4execE6QFlagsINS_17ProcessEventsFlagEE (libQt6Core.so.6 + 0x4bb4bb) #5 0x0000750c0e3c9627 _ZN7QThread4execEv (libQt6Core.so.6 + 0x3c9627) #6 0x0000750c0dbe1b3d n/a (libQt6DBus.so.6 + 0x9bb3d) #7 0x0000750c0e38fa39 n/a (libQt6Core.so.6 + 0x38fa39) #8 0x0000750c0d89caa4 start_thread (libc.so.6 + 0x9caa4) #9 0x0000750c0d929c3c __clone3 (libc.so.6 + 0x129c3c) Stack trace of thread 36278: #0 0x0000750c0d91b4cd __GI___poll (libc.so.6 + 0x11b4cd) #1 0x0000750c0c9398ca n/a (libxcb.so.1 + 0xc8ca) #2 0x0000750c0c93b28c xcb_wait_for_event (libxcb.so.1 + 0xe28c) #3 0x0000750c0a20b056 n/a (libQt6XcbQpa.so.6 + 0x64056) #4 0x0000750c0e38fa39 n/a (libQt6Core.so.6 + 0x38fa39) #5 0x0000750c0d89caa4 start_thread (libc.so.6 + 0x9caa4) #6 0x0000750c0d929c3c __clone3 (libc.so.6 + 0x129c3c) ELF object binary architecture: AMD x86-64 In addition, as there are multiple files, when it attempts to scan each one it spams my tray with crashed process messages which is really annoying. It now does this every time I login as well. This issue just started a few days ago, although the files themselves have existed in the same location for years and I do not recall this particular crash.
After further rebuilding the entire index, I can now add that those .csv files are not specifically the culprit, as there are many more that cause the exact same type of crash. The only thing that they have in common is that they all have text encoded as UTF-16.
(In reply to Garirry from comment #0) > #13 0x0000750c0a24e1f9 n/a (kfilemetadata_plaintextextractor.so + 0x31f9) If I run the same file on a more-or-less scratch system (Neon Unstable), I see Baloo deciding to use the plain text extractor (using inherited mimetype...) and then successfully indexing the file. It is an empty index though. > ASSERT: "!term.isEmpty()" in file ./src/engine/document.cpp, line 23 OK... that's fairly clear Does the same happen if you have the file in a folder of its own and just index that folder? (You can close down baloo and rename the .local/share/baloo/index file to keep it save)
(In reply to tagwerk19 from comment #2) > Does the same happen if you have the file in a folder of its own and just > index that folder? (You can close down baloo and rename the > .local/share/baloo/index file to keep it save) Yes, the exact same error occurs. The files which cause the crash do so consistently, if I don't exclude them then baloo scans them again on system boot and crashes for each file.
(In reply to Garirry from comment #3) > The files which cause the crash do so consistently, if I don't exclude them > then baloo scans them again on system boot and crashes for each file. I'm afraid don't really have an idea here... You are using CJK - Chinese? I apologise for not being familiar. I don't get a crash but I do find that if I index the file and check with "balooshow6 -x home.csv", I get a completely different set of plain text terms on a Neon User (dodgy) compared to a Neon Unstable (more sensible) As a marker, we've also had a recent Bug 505968 where there is some strange behaviour with CJK. https://bugs.kde.org/show_bug.cgi?id=505968#c2
(In reply to tagwerk19 from comment #4) > I'm afraid don't really have an idea here... You are using CJK - Chinese? I > apologise for not being familiar. No I don't use CJK generally. Although speaking of that, I do know that many if not all of the files that are affected, if opened in an editor like KWrite display CJK characters, and it detects an encoding of UTF-16. If that would help I could upload more file samples.
(In reply to Garirry from comment #5) > No I don't use CJK generally. Although speaking of that, I do know that many > if not all of the files that are affected, if opened in an editor like > KWrite display CJK characters, and it detects an encoding of UTF-16. If I look at your uploaded "home.csv" (in Libreoffice Calc) it looks like a set of translations - 10 languages that include Japanese and Chinese scripts (plus English, German, French etc, etc and etc) (In reply to tagwerk19 from comment #4) > I don't get a crash .... ... I've just tried a clean install of Neon User. I now see a crash. > ... a completely different set of plain text terms on a Neon User (dodgy) ... Best discard that result, it was on a system with a custom locale (it had LC_TIME=en_SE.UTF-8 to get ISO format short dates - maybe that's too wierd...) So, I can flag "Confirmed" but don't really know where it goes from here (on the basis that I don't get the crash on Neon Unstable or Neon testing). Summarising what I see... Neon User Plasma: 6.4.1 Frameworks: 6.15.0 Qt: 6.9.0 Wayland Crashes Neon Testing: Plasma: 6.4.1 Frameworks: 6.16.0 Qt: 6.9.0 Wayland Seems OK Neon Unstable: Plasma: 6.4.80 Frameworks: 6.16.0 Qt: 6.9.0 Wayland Seems OK
To tidy the UTF-16 loose end, converting the file from UTF-16 to UTF-8 with $ iconv -f UTF-16 -t UTF-8 home.csv > home2.csv Baloo can read and index it.
*** Bug 506516 has been marked as a duplicate of this bug. ***
This seems to be a cascade of bugs/implementation errors, finally triggering the assert. - The KFileMetaData plaintext extractor uses QIODevice::readline, although this is not supported for 16bit encodings (see https://bugreports.qt.io/browse/QTBUG-121812) - The split code returns a term QString which only contains invalid unicode code points - QString::toUtf8() returns an empty QByteArray
(In reply to Stefan Brüns from comment #9) > This seems to be a cascade of bugs/implementation errors, finally triggering > the assert. I find a couple of things confusing: * This has suddenly started happening, with several very similar bugs. * All appear on Neon User, the test case we have works on Neon Testing and Unstable. > - The KFileMetaData plaintext extractor uses QIODevice::readline, although > this is not supported for 16bit encodings (see > https://bugreports.qt.io/browse/QTBUG-121812) > - The split code returns a term QString which only contains invalid unicode > code points > - QString::toUtf8() returns an empty QByteArray That would explain why if you convert the file to UTF-8 with iconv, Baloo is happy
*** Bug 506598 has been marked as a duplicate of this bug. ***
*** Bug 506608 has been marked as a duplicate of this bug. ***
*** Bug 506570 has been marked as a duplicate of this bug. ***
There's a fix for the UTF-16 issue here: https://invent.kde.org/frameworks/kfilemetadata/-/merge_requests/193 Thank you Stefan! That's just landed on Neon Unstable. I don't know how long "due course" is but if it's on Neon Unstable it will arrive on Neon User in "due course" :-) This doesn't address Bug 506570, a binary file that says it's UTF-32, that seems a different issue
Git commit 9fa1aaaf4a841224161e791cb8ffd366485dc7e3 by Stefan Brüns. Committed on 06/07/2025 at 18:16. Pushed by bruns into branch 'master'. [PlaintextExtractor] Fix various issues with UTF-16 Read the file in binary mode, feed the complete data into QStringDecoder with the detected encoding, and split the lines last. Opening a file with open mode "QIODevice::Text" mangles Carriage Return sequences, and the UTF16-LE sequence "\r\0\n\0" ends up as "\0\n\0", i.e. an invalid sequence. QIODevice::readline() only supports 8 bit encodings (see QTBUG 121812), and the fixup attempts here were not working in general. Unfortunately, QTextStream::setEncoding only supports UTF encodings, but none of the legacy ISO-8859 or Windows encodings or e.g. GB18030. M +0 -2 autotests/indexerextractortests.cpp M +53 -25 src/extractors/plaintextextractor.cpp https://invent.kde.org/frameworks/kfilemetadata/-/commit/9fa1aaaf4a841224161e791cb8ffd366485dc7e3
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/241
Git commit d97f3f832f31a89f5ca4ee058043003bc1474223 by Stefan Brüns. Committed on 14/07/2025 at 12:13. Pushed by bruns into branch 'master'. [TermGenerator] Check input text validity In case the supplied text contains invalid surrogates (i.e. single low surrogates or without preceding high surrogate), the text is not valid unicode. This can also cause QString::toUtf8() to return an empty QByteArray. Related: bug 506570 M +43 -0 autotests/unit/engine/termgeneratortest.cpp M +12 -3 src/engine/termgenerator.cpp A +50 -0 src/engine/termgenerator_p.h [License: LGPL(v2.1+)] https://invent.kde.org/frameworks/baloo/-/commit/d97f3f832f31a89f5ca4ee058043003bc1474223