Bug 506570 - Binary data with UTF BOM misdetected as plain text
Summary: Binary data with UTF BOM misdetected as plain text
Status: RESOLVED FIXED
Alias: None
Product: frameworks-kfilemetadata
Classification: Frameworks and Libraries
Component: general (other bugs)
Version First Reported In: 6.15.0
Platform: Neon Linux
: NOR crash
Target Milestone: ---
Assignee: Pinak Ahuja
URL:
Keywords: drkonqi
Depends on:
Blocks:
 
Reported: 2025-07-04 08:14 UTC by Malte S. Stretz
Modified: 2025-07-14 20:15 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Test case (10.05 KB, application/octet-stream)
2025-07-04 18:36 UTC, Malte S. Stretz
Details
Smaller test case (19 bytes, application/octet-stream)
2025-07-07 08:35 UTC, Malte S. Stretz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Malte S. Stretz 2025-07-04 08:14:33 UTC
Application: baloo_file_extractor (6.15.0)

ApplicationNotResponding [ANR]: false
Qt Version: 6.9.0
Frameworks Version: 6.15.0
Operating System: Linux 6.11.0-29-generic x86_64
Windowing System: Wayland
Distribution: KDE neon User Edition
DrKonqi: 6.4.1 [CoredumpBackend]

-- Information about the crash:
I was using the tool available at https://www.volvocars.com/de/support/downloads/maps/iam21/europa/ to download ca. 22 GB of data to ~/Download.

At some point baloo started to do one of its crash-loop-dances.

I ran it via `flatpak run org.winehq.Wine Sensus_Update_RTI_Europe.exe`. Probably not relevant but the Wine Flatpak used was `app/org.winehq.Wine/x86_64/wow64-24.08`.

The crash can be reproduced every time.

-- Backtrace:
Application: Baloo File Extractor (baloo_file_extractor), signal: Aborted
Content of s_kcrashErrorMessage: std::unique_ptr<char []> = {get() = <optimized out>}
[New LWP 2214]
[New LWP 2217]
[New LWP 2215]
[New LWP 2216]
Downloading separate debug info for /lib/x86_64-linux-gnu/libglib-2.0.so.0...

warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libglib-2.0.so.0
Downloading separate debug info for /lib/x86_64-linux-gnu/libglib-2.0.so.0...

warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libcap.so.2
Downloading separate debug info for /usr/lib/x86_64-linux-gnu/qt6/plugins/kf6/kfilemetadata/kfilemetadata_exiv2extractor.so...
Downloading separate debug info for /lib/x86_64-linux-gnu/libexiv2.so.27...
Downloading separate debug info for system-supplied DSO at 0x7a8e9490a000...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/x86_64-linux-gnu/libexec/kf6/baloo_file_extractor'.
Program terminated with signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44

warning: 44	./nptl/pthread_kill.c: No such file or directory
[Current thread is 1 (Thread 0x7a8e91761a40 (LWP 2214))]

Cannot QML trace cores :(
Downloading source file /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/file/extractor/main.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qflags.h...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qeventdispatcher_glib.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qtimerinfo_unix.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qcoreapplication.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qobject.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qsingleshottimer.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/obj-x86_64-linux-gnu/src/corelib/Core_autogen/include/moc_qsingleshottimer_p.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qobjectdefs_impl.h...
Downloading source file /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/file/extractor/app.cpp...
Downloading source file /usr/src/kf6-kfilemetadata-6.15.0-0zneon+24.04+noble+release+build23/src/extractors/plaintextextractor.cpp...
Downloading source file /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/termgenerator.cpp...
Downloading source file /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/document.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qassert.cpp...
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qlogging.cpp...
Download failed: Invalid argument.  Continuing without source file ./stdlib/./stdlib/abort.c.
Download failed: Invalid argument.  Continuing without source file ./signal/../sysdeps/posix/raise.c.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
Download failed: Invalid argument.  Continuing without source file ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_create.c.
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp...
Downloading source file /usr/src/qt6-wayland-6.9.0-0zneon+24.04+noble+release+build29/src/client/qwaylanddisplay.cpp...
Download failed: Invalid argument.  Continuing without source file ./io/../sysdeps/unix/sysv/linux/poll.c.
Downloading source file /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/dbus/qdbusconnectionmanager.cpp...
[Current thread is 1 (Thread 0x7a8e91761a40 (LWP 2214))]

Thread 4 (Thread 0x7a8e8ebfe6c0 (LWP 2216)):
#0  0x00007a8e92f1b4cd in __GI___poll (fds=fds@entry=0x7a8e8ebfdb00, nfds=nfds@entry=2, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007a8e8f8bd2cf in poll (__timeout=-1, __nfds=2, __fds=0x7a8e8ebfdb00) at /usr/include/x86_64-linux-gnu/bits/poll2.h:39
#2  QtWaylandClient::EventThread::run (this=0x55d9986a3860) at /usr/src/qt6-wayland-6.9.0-0zneon+24.04+noble+release+build29/src/client/qwaylanddisplay.cpp:186
#3  0x00007a8e9398fa39 in operator() (__closure=<optimized out>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:433
#4  (anonymous namespace)::terminate_on_exception<QThreadPrivate::start(void*)::<lambda()> > (t=...) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:365
#5  QThreadPrivate::start (arg=0x55d9986a3860) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:393
#6  0x00007a8e92e9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007a8e92f29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 3 (Thread 0x7a8e8f3ff6c0 (LWP 2215)):
#0  0x00007a8e92f1b4cd in __GI___poll (fds=0x55d9986a5120, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007a8e92c3668e in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007a8e92bd6a63 in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007a8e93860b3f in QEventDispatcherGlib::processEvents (this=0x7a8e88000b70, flags=...) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qeventdispatcher_glib.cpp:399
#4  0x00007a8e93abb4bb in QEventLoop::exec (this=0x7a8e8f3feac0, flags=...) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qflags.h:77
#5  0x00007a8e939c9627 in QThread::exec (this=this@entry=0x7a8e931ff540 <QGlobalStatic<QtGlobalStatic::Holder<(anonymous namespace)::Q_QGS__q_manager> >::instance()::holder>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qflags.h:77
#6  0x00007a8e931e1b3d in QDBusConnectionManager::run (this=0x7a8e931ff540 <QGlobalStatic<QtGlobalStatic::Holder<(anonymous namespace)::Q_QGS__q_manager> >::instance()::holder>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/dbus/qdbusconnectionmanager.cpp:144
#7  0x00007a8e9398fa39 in operator() (__closure=<optimized out>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:433
#8  (anonymous namespace)::terminate_on_exception<QThreadPrivate::start(void*)::<lambda()> > (t=...) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:365
#9  QThreadPrivate::start (arg=0x7a8e931ff540 <QGlobalStatic<QtGlobalStatic::Holder<(anonymous namespace)::Q_QGS__q_manager> >::instance()::holder>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:393
#10 0x00007a8e92e9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#11 0x00007a8e92f29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2 (Thread 0x7a8e8e3fd6c0 (LWP 2217)):
#0  0x00007a8e92f1b4cd in __GI___poll (fds=fds@entry=0x7a8e8e3fcb00, nfds=nfds@entry=2, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007a8e8f8bd2cf in poll (__timeout=-1, __nfds=2, __fds=0x7a8e8e3fcb00) at /usr/include/x86_64-linux-gnu/bits/poll2.h:39
#2  QtWaylandClient::EventThread::run (this=0x55d9986bb070) at /usr/src/qt6-wayland-6.9.0-0zneon+24.04+noble+release+build29/src/client/qwaylanddisplay.cpp:186
#3  0x00007a8e9398fa39 in operator() (__closure=<optimized out>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:433
#4  (anonymous namespace)::terminate_on_exception<QThreadPrivate::start(void*)::<lambda()> > (t=...) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:365
#5  QThreadPrivate::start (arg=0x55d9986bb070) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/thread/qthread_unix.cpp:393
#6  0x00007a8e92e9caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007a8e92f29c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 1 (Thread 0x7a8e91761a40 (LWP 2214)):
[KCrash Handler]
#6  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#7  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#8  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#9  0x00007a8e92e4527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x00007a8e92e288ff in __GI_abort () at ./stdlib/abort.c:79
#11 0x00007a8e93b247d9 in qAbort () at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qassert.cpp:46
#12 qt_message_fatal<QString&> (message=..., context=...) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qlogging.cpp:2149
#13 qt_message(QtMsgType, const QMessageLogContext &, const char *, typedef __va_list_tag __va_list_tag *) (msgType=msgType@entry=QtFatalMsg, context=..., msg=msg@entry=0x7a8e93849e88 "ASSERT: \"%s\" in file %s, line %d", ap=ap@entry=0x7ffc29740f98) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qlogging.cpp:381
#14 0x00007a8e93b25ac7 in QMessageLogger::fatal (this=<optimized out>, msg=0x7a8e93849e88 "ASSERT: \"%s\" in file %s, line %d") at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qlogging.cpp:883
#15 0x00007a8e93b100d1 in qt_assert (assertion=assertion@entry=0x7a8e9471b263 "!term.isEmpty()", file=file@entry=0x7a8e9471b231 "./src/engine/document.cpp", line=line@entry=23) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qassert.cpp:105
#16 0x00007a8e946f15e2 in Baloo::Document::addPositionTerm (this=<optimized out>, term=..., position=<optimized out>) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/document.cpp:21
#17 Baloo::Document::addPositionTerm (this=<optimized out>, term=..., position=<optimized out>) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/document.cpp:21
#18 0x00007a8e947075e6 in Baloo::TermGenerator::indexText (this=0x7ffc29741968, text=..., prefix=...) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/termgenerator.cpp:110
#19 0x00007a8e947076d4 in Baloo::TermGenerator::indexText (this=<optimized out>, text=...) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/termgenerator.cpp:51
#20 0x00007a8e8f7971f9 in KFileMetaData::PlainTextExtractor::extract (this=<optimized out>, result=0x7ffc297418e0) at /usr/src/kf6-kfilemetadata-6.15.0-0zneon+24.04+noble+release+build23/src/extractors/plaintextextractor.cpp:119
#21 0x000055d987cb546f in Baloo::App::index (this=this@entry=0x7ffc29742180, tr=0x55d99bed72e0, url=..., id=id@entry=54499472192832156) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/file/extractor/app.cpp:185
#22 0x000055d987cb6e6d in Baloo::App::processNextFile (this=0x7ffc29742180) at /usr/include/c++/13/bits/unique_ptr.h:199
#23 0x00007a8e93a77a99 in QtPrivate::QSlotObjectBase::call (a=<optimized out>, r=<optimized out>, this=<optimized out>, this=<optimized out>, r=<optimized out>, a=<optimized out>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qobjectdefs_impl.h:461
#24 doActivate<false> (sender=0x55d998786ea0, signal_index=3, argv=0x7ffc29741c68) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qobject.cpp:4138
#25 0x00007a8e93a17d4d in QSingleShotTimer::timeout (this=0x55d998786ea0) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/obj-x86_64-linux-gnu/src/corelib/Core_autogen/include/moc_qsingleshottimer_p.cpp:116
#26 QSingleShotTimer::timerEvent (this=0x55d998786ea0) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qsingleshottimer.cpp:71
#27 0x00007a8e93a00ae6 in QObject::event (this=0x55d998786ea0, e=0x7ffc29741e10) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qobject.cpp:1406
#28 0x00007a8e93ab0dd0 in QCoreApplication::notifyInternal2 (receiver=0x55d998786ea0, event=0x7ffc29741e10) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qcoreapplication.cpp:1106
#29 0x00007a8e9398f087 in QTimerInfoList::activateTimers (this=0x55d9986b40f0) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qtimerinfo_unix.cpp:426
#30 0x00007a8e93861dd1 in timerSourceDispatch (source=<optimized out>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qeventdispatcher_glib.cpp:152
#31 idleTimerSourceDispatch (source=<optimized out>) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qeventdispatcher_glib.cpp:199
#32 0x00007a8e92bd75c5 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#33 0x00007a8e92c36737 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#34 0x00007a8e92bd6a63 in g_main_context_iteration () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#35 0x00007a8e93860b3f in QEventDispatcherGlib::processEvents (this=0x55d9986948d0, flags=...) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/kernel/qeventdispatcher_glib.cpp:399
#36 0x00007a8e93abb4bb in QEventLoop::exec (this=0x7ffc29742080, flags=...) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qflags.h:77
#37 0x00007a8e93ab405f in QCoreApplication::exec () at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qflags.h:77
#38 0x00007a8e93edd49d in QGuiApplication::exec () at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/gui/kernel/qguiapplication.cpp:1993
#39 0x000055d987cabc94 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/file/extractor/main.cpp:43

Reported using DrKonqi
Comment 1 Malte S. Stretz 2025-07-04 08:17:57 UTC
According to the journal the crash is caused by this assertion:

> Jul 04 10:10:56 localhost baloo_file_extractor[10785]: ASSERT: "!term.isEmpty()" in file ./src/engine/document.cpp, line 23
Comment 2 tagwerk19 2025-07-04 10:22:31 UTC
(In reply to Malte S. Stretz from comment #0)
> #15 0x00007a8e93b100d1 in qt_assert (assertion=assertion@entry=0x7a8e9471b263 "!term.isEmpty()", file=file@entry=0x7a8e9471b231 "./src/engine/document.cpp", line=line@entry=23) at /usr/src/qt6-base-6.9.0-0zneon+24.04+noble+release+build112/src/corelib/global/qassert.cpp:105
> #16 0x00007a8e946f15e2 in Baloo::Document::addPositionTerm (this=<optimized out>, term=..., position=<optimized out>) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/document.cpp:21
> #17 Baloo::Document::addPositionTerm (this=<optimized out>, term=..., position=<optimized out>) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/ document.cpp:21
> #18 0x00007a8e947075e6 in Baloo::TermGenerator::indexText (this=0x7ffc29741968, text=..., prefix=...) at /usr/src/kf6-baloo-6.15.0-0zneon+24.04+noble+release+build39/src/engine/termgenerator.cpp:110
Possibly an instance of Bug 506516 / Bug 506187, it looks suspiciously similar....

I'll cut and paste from https://bugs.kde.org/show_bug.cgi?id=506516#c1, it would be interesting if you've got the root cause:

... need to find out which file is causing trouble. In 506187 a UTF-16 file was the cause (a UTF-16 file that contained Chinese/Japanese scripts but that may not be a "necessary condition").

You should be able to follow what's being indexed by running "balooctl6 monitor" or by enabling logging and following the journal.

You can set up logging by creating a "~/.config/QtProject/qtlogging.ini" file containing:

    [rules]
    kf.baloo=true
    kf.baloo.*=true
    kf.kfilemetadata=true

and then restart Baloo, you might need:

    $ pkill baloo_file
    $ systemctl start --user kde-baloo

If you are seeing the same issue, the good news is that it's not there in Neon Testing or Unstable.
Comment 3 Malte S. Stretz 2025-07-04 14:09:01 UTC
Thanks for the pointers. I can reproduce the issue via

```
find "~/Downloads/Sensus Update SPA (Europe)" -type f -print0 | xargs -0 balooctl6 clear
find "~/Downloads/Sensus Update SPA (Europe)" -type f -print0 | xargs -0 balooctl6 check
```

Unfortunately is `balooct6 monitor` not very helpful in telling me which file had the issue but since it happens quite quickly I think I will be able to identify the file.
Comment 4 tagwerk19 2025-07-04 16:26:01 UTC
(In reply to Malte S. Stretz from comment #3)
> Unfortunately is `balooct6 monitor` not very helpful in telling me which
> file had the issue but since it happens quite quickly I think I will be able
> to identify the file.
You might get an "Indexing: /home/whoever/whereever/whatever" followed immediately by an "Idle". If the indexing succeeds you'd get an "Ok" when it is done.

You can try "balooctl6 status" and "balooctl6 failed" to see whether Baloo has recognised the crash and flagged the failure in its index.
Comment 5 Malte S. Stretz 2025-07-04 18:34:34 UTC
(In reply to tagwerk19 from comment #4)
> (In reply to Malte S. Stretz from comment #3)
> > Unfortunately is `balooct6 monitor` not very helpful in telling me which
> > file had the issue but since it happens quite quickly I think I will be able
> > to identify the file.
> You might get an "Indexing: /home/whoever/whereever/whatever" followed
> immediately by an "Idle". If the indexing succeeds you'd get an "Ok" when it
> is done.
> 
> You can try "balooctl6 status" and "balooctl6 failed" to see whether Baloo
> has recognised the crash and flagged the failure in its index.

Ok, the trick is to look for lines which just do not end with a `Ok` in the monitor output. I first thought they were all Ok while baloo was crashing left and right.

I will attach one of the files. `file` (and I guess thus baloo, too) thinks they are UTF-32 text files but they are binary files which  just happen to start with a BOM  (some speech synthesis stuff, this is some quite outdated update for a Volvo navigation system).

```
# file ASYNTH/SPEECHVF/ES_SPA/FEMALE/COMPS/DEPES_MO.DAT
ASYNTH/SPEECHVF/ES_SPA/FEMALE/COMPS/DEPES_MO.DAT: Unicode text, UTF-32, little-endian
# hexdump -C ASYNTH/SPEECHVF/ES_SPA/FEMALE/COMPS/DEPES_MO.DAT | head
00000000  ff fe 00 00 53 43 41 4e  53 4f 46 54 64 65 70 65  |....SCANSOFTdepe|
00000010  73 00 00 00 5c 00 00 00  31 2e 30 30 00 00 00 00  |s...\...1.00....|
00000020  73 70 65 00 6d 6f 6e 69  63 61 00 00 00 00 00 00  |spe.monica......|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000040  00 00 00 00 01 00 00 00  66 65 5f 64 65 70 65 73  |........fe_depes|
00000050  00 00 00 00 5c 00 00 00  d8 27 00 00 0e 00 00 00  |....\....'......|
00000060  67 6c 6f 62 61 6c 00 00  02 00 00 00 04 09 00 00  |global..........|
00000070  05 00 00 00 09 00 00 00  14 00 00 00 06 00 00 00  |................|
00000080  21 00 00 00 05 00 00 00  a0 00 00 00 7c 0b 00 00  |!...........|...|
00000090  00 00 02 02 02 02 02 02  02 00 00 00 00 00 00 00  |................|
```
Comment 6 Malte S. Stretz 2025-07-04 18:36:34 UTC
Created attachment 182960 [details]
Test case
Comment 7 tagwerk19 2025-07-04 19:07:27 UTC
(In reply to Malte S. Stretz from comment #6)
> Created attachment 182960 [details]
> Test case
I can confirm, this crashes on a new Neon User.

For some reason, I don't get a crash on Neon Testing but do on Neon Unstable.
Comment 8 tagwerk19 2025-07-04 20:26:37 UTC
(In reply to Malte S. Stretz from comment #5)
> I will attach one of the files. `file` (and I guess thus baloo, too) thinks
> they are UTF-32 text files but they are binary files which  just happen to
> start with a BOM  (some speech synthesis stuff, this is some quite outdated
> update for a Volvo navigation system).
iconv doesn't like it:

    $ file DEPES_MO.DAT
    DEPES_MO.DAT: Unicode text, UTF-32, little-endian
    $ iconv -f UTF-32 -t UTF-8 DEPES_MO.DAT
    iconv: illegal input sequence at position 4

The question remains how triggers an assert...
Comment 9 tagwerk19 2025-07-04 21:26:06 UTC
(In reply to tagwerk19 from comment #7)
> For some reason, I don't get a crash on Neon Testing but do on Neon Unstable.
Correction... I do get a crash on Neon Testing
Comment 10 tagwerk19 2025-07-06 12:50:00 UTC
It seems likely that this issue is the same as that for UTF-16 files. The root cause of which has been identified and described here:
    https://bugs.kde.org/show_bug.cgi?id=506187#c9

I'll close this as a duplicate, if you discover anything new, please reopen!

*** This bug has been marked as a duplicate of bug 506187 ***
Comment 11 Stefan Brüns 2025-07-06 14:51:21 UTC
Requires a more specific content detection.
Comment 12 tagwerk19 2025-07-07 08:32:50 UTC
(In reply to tagwerk19 from comment #10)
> It seems likely that this issue is the same as that for UTF-16 files...
Wrong on that one, seemingly a separate issue.

There's a fix in for the UTF-16 problem which should get to Neon User soon (it's there on Neon Unstable) but that fix doesn't help here.
Comment 13 Malte S. Stretz 2025-07-07 08:35:53 UTC
Created attachment 183031 [details]
Smaller test case

Here is a more minimalistic test case which triggers the crash.
Comment 14 Malte S. Stretz 2025-07-07 08:40:18 UTC
This probably won't completely fix this issue but I think a small change could be added here https://invent.kde.org/frameworks/kfilemetadata/-/blob/master/src/extractors/plaintextextractor.cpp?ref_type=heads#L72 with a small sanity check:

* The size of a file encoded in UTF-16 has to be dividable by 2
* The size of a file encoded in UTF-32 has to be dividable by 4
Comment 15 tagwerk19 2025-07-07 08:45:12 UTC
(In reply to Malte S. Stretz from comment #14)
> * The size of a file encoded in UTF-16 has to be dividable by 2
> * The size of a file encoded in UTF-32 has to be dividable by 4
Neat :-)
Comment 16 Stefan Brüns 2025-07-07 15:06:48 UTC
Git commit 0d63429c72a3839c36ec2134205977a44711b433 by Stefan Brüns.
Committed on 06/07/2025 at 18:16.
Pushed by bruns into branch 'master'.

[PlaintextExtractor] Verify decoded text contains printable characters

In case the decoded text mostly contains control characters and similar,
but hardly any letters, number or punctuation, it is very likely the
file actually contains fairly arbitrary binary data.

This mostly happens when a file starts with a BOM, as it will be
detected as text/plain by the mime database.

M  +33   -0    src/extractors/plaintextextractor.cpp

https://invent.kde.org/frameworks/kfilemetadata/-/commit/0d63429c72a3839c36ec2134205977a44711b433
Comment 17 Bug Janitor Service 2025-07-13 12:25:08 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/241
Comment 18 Stefan Brüns 2025-07-14 20:15:57 UTC
Git commit d97f3f832f31a89f5ca4ee058043003bc1474223 by Stefan Brüns.
Committed on 14/07/2025 at 12:13.
Pushed by bruns into branch 'master'.

[TermGenerator] Check input text validity

In case the supplied text contains invalid surrogates (i.e. single
low surrogates or without preceding high surrogate), the text is not
valid unicode. This can also cause QString::toUtf8() to return an
empty QByteArray.
Related: bug 506187

M  +43   -0    autotests/unit/engine/termgeneratortest.cpp
M  +12   -3    src/engine/termgenerator.cpp
A  +50   -0    src/engine/termgenerator_p.h     [License: LGPL(v2.1+)]

https://invent.kde.org/frameworks/baloo/-/commit/d97f3f832f31a89f5ca4ee058043003bc1474223