Application: baloo_file_extractor (5.104.0) Qt Version: 5.15.8 Frameworks Version: 5.104.0 Operating System: Linux 5.14.21-150400.24.46-default x86_64 Windowing System: X11 Distribution: "openSUSE Leap 15.4" DrKonqi: 5.27.3 [KCrashBackend] -- Information about the crash: Baloo seems to crash when it hits large PDF files. Since this document is related to my work it cannot be shared, but I am seeing this happen frequently with large PDF files, often 5MB or more in size. The reporter is unsure if this crash is reproducible. -- Backtrace: Application: Baloo File Extractor (baloo_file_extractor), signal: Aborted Content of s_kcrashErrorMessage: std::unique_ptr<char []> = {get() = 0x0} [KCrash Handler] #6 __GI_raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #7 0x00007f493b919355 in __GI_abort () at abort.c:79 #8 0x00007f493b95dae7 in __libc_message (action=do_abort, fmt=0x7f493ba857d8 "%s\n") at ../sysdeps/posix/libc_fatal.c:155 #9 0x00007f493b965b6a in malloc_printerr (str=0x7f493ba835c5 "realloc(): invalid pointer") at malloc.c:5347 #10 0x00007f493b96a1b4 in realloc_check (oldmem=0x7f08fe15f010, bytes=1312768, caller=<optimized out>) at hooks.c:291 #11 0x00007f493ae0f3f1 in mdb_midl_need (idp=idp@entry=0x5567999faff8, num=164096, num@entry=1) at midl.c:148 #12 0x00007f493ae071be in mdb_page_touch (mc=mc@entry=0x7ffd07b1ee70) at mdb.c:2370 #13 0x00007f493ae08cf4 in mdb_cursor_touch (mc=mc@entry=0x7ffd07b1ee70) at mdb.c:6308 #14 0x00007f493ae0be8e in mdb_cursor_put (mc=0x7ffd07b1ee70, key=0x7ffd07b1f250, data=0x7ffd07b1f260, flags=<optimized out>) at mdb.c:6442 #15 0x00007f493ae0eb1b in mdb_put (txn=0x5567999fafd0, dbi=2, key=key@entry=0x7ffd07b1f250, data=data@entry=0x7ffd07b1f260, flags=flags@entry=0) at mdb.c:8800 #16 0x00007f493cea6efc in Baloo::PostingDB::put (this=this@entry=0x7ffd07b1f350, term=..., list=...) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/engine/postingdb.cpp:67 #17 0x00007f493ceb84d4 in Baloo::WriteTransaction::commit (this=0x55679ccb42e0) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/engine/writetransaction.cpp:312 #18 0x00007f493ceaee0f in Baloo::Transaction::commit (this=0x5567b1481640) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/engine/transaction.cpp:272 #19 0x0000556797c311bc in Baloo::App::processNextFile (this=0x7ffd07b1f8d0) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/file/extractor/app.cpp:109 #20 0x00007f493c21e634 in QtPrivate::QSlotObjectBase::call (a=0x7ffd07b1f4a0, r=<optimized out>, this=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qobjectdefs_impl.h:398 #21 QSingleShotTimer::timerEvent (this=0x5567aeeb1e60) at kernel/qtimer.cpp:320 #22 0x00007f493c210443 in QObject::event (this=0x5567aeeb1e60, e=0x7ffd07b1f5d0) at kernel/qobject.cpp:1369 #23 0x00007f493c1dc043 in QCoreApplication::notifyInternal2 (receiver=0x5567aeeb1e60, event=0x7ffd07b1f5d0) at kernel/qcoreapplication.cpp:1064 #24 0x00007f493c23de19 in QTimerInfoList::activateTimers (this=0x5567999339f0) at kernel/qtimerinfo_unix.cpp:643 #25 0x00007f493c23e619 in timerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:183 #26 idleTimerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:230 #27 0x00007f49387e282b in g_main_dispatch (context=0x556799762e00) at ../glib/gmain.c:3381 #28 g_main_context_dispatch (context=context@entry=0x556799762e00) at ../glib/gmain.c:4099 #29 0x00007f49387e2bd0 in g_main_context_iterate (context=context@entry=0x556799762e00, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4175 #30 0x00007f49387e2c5c in g_main_context_iteration (context=0x556799762e00, may_block=may_block@entry=1) at ../glib/gmain.c:4240 #31 0x00007f493c23e98c in QEventDispatcherGlib::processEvents (this=0x556799934190, flags=...) at kernel/qeventdispatcher_glib.cpp:423 #32 0x00007f493c1da8aa in QEventLoop::exec (this=this@entry=0x7ffd07b1f820, flags=..., flags@entry=...) at kernel/qeventloop.cpp:235 #33 0x00007f493c1e40e7 in QCoreApplication::exec () at kernel/qcoreapplication.cpp:1375 #34 0x00007f493c62fa7c in QGuiApplication::exec () at kernel/qguiapplication.cpp:1870 #35 0x0000556797c2ded1 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/file/extractor/main.cpp:43 [Inferior 1 (process 3272508) detached] Reported using DrKonqi
This is an offchance... There was another issue with extracting text from PDFs here (in this case a PDF that contained a scientific plot, something generated by R) in Bug 380456. It didn't fail, as far as I remember, but took days to index. Check that pdftotext is able to extract the text... https://bugs.kde.org/show_bug.cgi?id=380456#c21 Be aware that you may also be running into issues with BTRFS and be reindexing your folders as they appear with different Device ID's. You might have a wildly big (and possibly corrupt) index file Have a look at https://bugs.kde.org/show_bug.cgi?id=400704#c31
I am not running BTRFS. I've had too many bad experiences with it and lost too many BTRF filesystems in the past to trust it, plus it has this habit of suddenly going out to lunch to do some processing. I have many TiB of data, hence I use XFS. pdftotext seems to work. Many of the files that are listed as failed are schematic files which are mostly lines and graphics, but others are long documents that are thousands of pages in length which are mostly text. Another common element I see in the failed files is that they have a space in the filename, but this is only some of them. In this case, I am getting the above trap in Baloo where it's failing in realloc, likely due to heap corruption.
I might add that I do not think it is memory constrained. The system this is running on has 128GiB of RAM and significant swap. I do not see baloo consuming excessive memory, though the index file is a bit over 60 GiB. Given that it's going through around 1.5M files and many TiB of data I am not surprised.
While I had been seeing these periodic failures earlier, something definitely got goofed up. balooctl status kf.i18n: KLocalizedString: Using an empty domain, fix the code. msgid: "Unknown" msgid_plural: "" msgctxt: "" kf.i18n: KLocalizedString: Using an empty domain, fix the code. msgid: "Indexing file content" msgid_plural: "" msgctxt: "" Baloo File Indexer is running Indexer state: Indexing file content Total files indexed: 854,351 Files waiting for content indexing: 460,114 Files failed to index: 35 Current size of index is 3.98 PiB 8.2G -rw-r--r-- 1 XXXXXX users 4.0P Mar 28 00:04 index I don't think the index file is supposed to grow to 4.0 PiB, even if it is sparse! Note that this growth happened after I reported this problem.
balooctl indexSize File Size: 8.18 GiB Used: 3.94 GiB PostingDB: 1.12 GiB 28.433 % PositionDB: 1.91 GiB 48.445 % DocTerms: 641.15 MiB 15.887 % DocFilenameTerms: 65.43 MiB 1.621 % DocXattrTerms: 4.00 KiB 0.000 % IdTree: 17.61 MiB 0.436 % IdFileName: 73.20 MiB 1.814 % DocTime: 44.43 MiB 1.101 % DocData: 63.86 MiB 1.582 % ContentIndexingDB: 13.13 MiB 0.325 % FailedIdsDB: 4.00 KiB 0.000 % MTimeDB: 14.30 MiB 0.354 %
(In reply to Aaron Williams from comment #2) > ... hence I use XFS. It might still be worth the "sanity check" of running, for a file you know has been indexed: baloosearch -i filename:the-file-name and see whether you get just the one hit. The number is the "docID", a combination of the device and inode number. If you see several hits with different docID's, you are on shifting sands. I'm afraid I don't know how XFS behaves (and whether you have any extra layers) but it's worth keeping an eye on the results from stat the-file-name and see if the device and inode numbers for the file change (for example over reboots). Igor Poboiko has a python "baloo-checkdb.py" script that checks the index for consistency https://invent.kde.org/frameworks/baloo/uploads/bdc9f5f17fc96490b7bd4a22ac664843/baloo-checkdb.py See https://invent.kde.org/frameworks/baloo/-/merge_requests/87#note_535270 for context I would worry about trying it on such a large index, but who knows...
./baloo-checkdb.py Loading DB from /home/aaronw/.local/share/baloo/index... Traceback (most recent call last): File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 265, in <module> db.check() File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 235, in check self.load_all() File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 52, in load_all self._load_posting() File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 87, in _load_posting for (key, value) in txn.cursor(): lmdb.PageNotFoundError: mdb_cursor_get: MDB_PAGE_NOTFOUND: Requested page not found Tracking it down it looks like LMDB is returning MDB_PAGE_NOTFOUND, indicating the database is corrupt. Note that I have seen the Baloo database suddenly grow to 4PiB in size several times.
(In reply to Aaron Williams from comment #7) > Tracking it down it looks like LMDB is returning MDB_PAGE_NOTFOUND, > indicating the database is corrupt. That seems conclusive :-( I've noticed times where "having to index" a *very* large number of new files pushed baloo over the edge. At the moment it makes its internal list of files to index (in memory) and commits only when it's done. That can be a demanding on memory. I'm suspicious abut whether this works if extending into swap but could not reproduce the issue though... Probably a not a problem for you with the amount of RAM you've got available. Maybe time to dig up a backup and find an earlier .local/share/baloo
I reset the index. It looks like the database became corrupted. Even after recreating the database, I am getting this error again when it hits certain files like some of the large PDF documents I have. One PDF file it is crashing on is 11MiB in size. Again, it's the same realloc bug listed in the stack trace. I cannot provide the files that cause it to crash since they are proprietary documents.
I also have not backed up the index file. It typically quickly grows fairly large so I specifically do not back it up with my backup system.
(In reply to Aaron Williams from comment #9) > ... I am getting this error again when it hits certain files like some of the large PDF documents I have... I have seen trouble with a PDF generated by: adobe psl 1.3e for canon and managed to find a test case on the internet that also caused the crash: https://usermanual.wiki/m/638471663caae5d9a0e8cb8fbcdb7aef415557811467c8211b3956f6a5333e80.pdf As per: https://invent.kde.org/frameworks/baloo/-/merge_requests/87 the PDF included non-printable index terms that led to index corruption (see also Bug 464226). The good news here is that the fix should arrive soon, in Frameworks 5.105. Perhaps this is the issue...
I'm not entirely sure it's PDFs. Unfortunately when I get a stack trace it doesn't report the file that it died on. In the monitor I see it failing even on some html files.