Bug 467865

Summary: Realloc failure on certain files
Product: [Frameworks and Libraries] frameworks-baloo Reporter: Aaron Williams <aaronw>
Component: Baloo File DaemonAssignee: baloo-bugs-null
Status: REPORTED ---    
Severity: crash CC: gerrit.huebbers, tagwerk19
Priority: NOR Keywords: drkonqi
Version First Reported In: 5.104.0   
Target Milestone: ---   
Platform: openSUSE   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Aaron Williams 2023-03-27 22:23:26 UTC
Application: baloo_file_extractor (5.104.0)

Qt Version: 5.15.8
Frameworks Version: 5.104.0
Operating System: Linux 5.14.21-150400.24.46-default x86_64
Windowing System: X11
Distribution: "openSUSE Leap 15.4"
DrKonqi: 5.27.3 [KCrashBackend]

-- Information about the crash:
Baloo seems to crash when it hits large PDF files. Since this document is related to my work it cannot be shared, but I am seeing this happen frequently with large PDF files, often 5MB or more in size.

The reporter is unsure if this crash is reproducible.

-- Backtrace:
Application: Baloo File Extractor (baloo_file_extractor), signal: Aborted
Content of s_kcrashErrorMessage: std::unique_ptr<char []> = {get() = 0x0}
[KCrash Handler]
#6  __GI_raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#7  0x00007f493b919355 in __GI_abort () at abort.c:79
#8  0x00007f493b95dae7 in __libc_message (action=do_abort, fmt=0x7f493ba857d8 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#9  0x00007f493b965b6a in malloc_printerr (str=0x7f493ba835c5 "realloc(): invalid pointer") at malloc.c:5347
#10 0x00007f493b96a1b4 in realloc_check (oldmem=0x7f08fe15f010, bytes=1312768, caller=<optimized out>) at hooks.c:291
#11 0x00007f493ae0f3f1 in mdb_midl_need (idp=idp@entry=0x5567999faff8, num=164096, num@entry=1) at midl.c:148
#12 0x00007f493ae071be in mdb_page_touch (mc=mc@entry=0x7ffd07b1ee70) at mdb.c:2370
#13 0x00007f493ae08cf4 in mdb_cursor_touch (mc=mc@entry=0x7ffd07b1ee70) at mdb.c:6308
#14 0x00007f493ae0be8e in mdb_cursor_put (mc=0x7ffd07b1ee70, key=0x7ffd07b1f250, data=0x7ffd07b1f260, flags=<optimized out>) at mdb.c:6442
#15 0x00007f493ae0eb1b in mdb_put (txn=0x5567999fafd0, dbi=2, key=key@entry=0x7ffd07b1f250, data=data@entry=0x7ffd07b1f260, flags=flags@entry=0) at mdb.c:8800
#16 0x00007f493cea6efc in Baloo::PostingDB::put (this=this@entry=0x7ffd07b1f350, term=..., list=...) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/engine/postingdb.cpp:67
#17 0x00007f493ceb84d4 in Baloo::WriteTransaction::commit (this=0x55679ccb42e0) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/engine/writetransaction.cpp:312
#18 0x00007f493ceaee0f in Baloo::Transaction::commit (this=0x5567b1481640) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/engine/transaction.cpp:272
#19 0x0000556797c311bc in Baloo::App::processNextFile (this=0x7ffd07b1f8d0) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/file/extractor/app.cpp:109
#20 0x00007f493c21e634 in QtPrivate::QSlotObjectBase::call (a=0x7ffd07b1f4a0, r=<optimized out>, this=<optimized out>) at ../../include/QtCore/../../src/corelib/kernel/qobjectdefs_impl.h:398
#21 QSingleShotTimer::timerEvent (this=0x5567aeeb1e60) at kernel/qtimer.cpp:320
#22 0x00007f493c210443 in QObject::event (this=0x5567aeeb1e60, e=0x7ffd07b1f5d0) at kernel/qobject.cpp:1369
#23 0x00007f493c1dc043 in QCoreApplication::notifyInternal2 (receiver=0x5567aeeb1e60, event=0x7ffd07b1f5d0) at kernel/qcoreapplication.cpp:1064
#24 0x00007f493c23de19 in QTimerInfoList::activateTimers (this=0x5567999339f0) at kernel/qtimerinfo_unix.cpp:643
#25 0x00007f493c23e619 in timerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:183
#26 idleTimerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:230
#27 0x00007f49387e282b in g_main_dispatch (context=0x556799762e00) at ../glib/gmain.c:3381
#28 g_main_context_dispatch (context=context@entry=0x556799762e00) at ../glib/gmain.c:4099
#29 0x00007f49387e2bd0 in g_main_context_iterate (context=context@entry=0x556799762e00, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4175
#30 0x00007f49387e2c5c in g_main_context_iteration (context=0x556799762e00, may_block=may_block@entry=1) at ../glib/gmain.c:4240
#31 0x00007f493c23e98c in QEventDispatcherGlib::processEvents (this=0x556799934190, flags=...) at kernel/qeventdispatcher_glib.cpp:423
#32 0x00007f493c1da8aa in QEventLoop::exec (this=this@entry=0x7ffd07b1f820, flags=..., flags@entry=...) at kernel/qeventloop.cpp:235
#33 0x00007f493c1e40e7 in QCoreApplication::exec () at kernel/qcoreapplication.cpp:1375
#34 0x00007f493c62fa7c in QGuiApplication::exec () at kernel/qguiapplication.cpp:1870
#35 0x0000556797c2ded1 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/baloo5-5.104.0-lp154.300.1.x86_64/src/file/extractor/main.cpp:43
[Inferior 1 (process 3272508) detached]

Reported using DrKonqi
Comment 1 tagwerk19 2023-03-28 06:23:21 UTC
This is an offchance...

There was another issue with extracting text from PDFs here (in this case a PDF that contained a scientific plot, something generated by R) in Bug 380456. It didn't fail, as far as I remember, but took days to index.
Check that pdftotext is able to extract the text... https://bugs.kde.org/show_bug.cgi?id=380456#c21

Be aware that you may also be running into issues with BTRFS and be reindexing your folders as they appear with different Device ID's. You might have a wildly big (and possibly corrupt) index file
Have a look at https://bugs.kde.org/show_bug.cgi?id=400704#c31
Comment 2 Aaron Williams 2023-03-28 07:01:00 UTC
I am not running BTRFS. I've had too many bad experiences with it and lost too many BTRF filesystems in the past to trust it, plus it has this habit of suddenly going out to lunch to do some processing. I have many TiB of data, hence I use XFS. pdftotext seems to work. Many of the files that are listed as failed are schematic files which are mostly lines and graphics, but others are long documents that are thousands of pages in length which are mostly text. Another common element I see in the failed files is that they have a space in the filename, but this is only some of them.
In this case, I am getting the above trap in Baloo where it's failing in realloc, likely due to heap corruption.
Comment 3 Aaron Williams 2023-03-28 07:03:05 UTC
I might add that I do not think it is memory constrained. The system this is running on has 128GiB of RAM and significant swap. I do not see baloo consuming excessive memory, though the index file is a bit over 60 GiB. Given that it's going through around 1.5M files and many TiB of data I am not surprised.
Comment 4 Aaron Williams 2023-03-28 07:06:07 UTC
While I had been seeing these periodic failures earlier, something definitely got goofed up.


balooctl status
kf.i18n: KLocalizedString: Using an empty domain, fix the code. msgid: "Unknown" msgid_plural: "" msgctxt: ""
kf.i18n: KLocalizedString: Using an empty domain, fix the code. msgid: "Indexing file content" msgid_plural: "" msgctxt: ""
Baloo File Indexer is running
Indexer state: Indexing file content
Total files indexed: 854,351
Files waiting for content indexing: 460,114
Files failed to index: 35
Current size of index is 3.98 PiB

8.2G -rw-r--r--   1 XXXXXX users 4.0P Mar 28 00:04 index
I don't think the index file is supposed to grow to 4.0 PiB, even if it is sparse!
Note that this growth happened after I reported this problem.
Comment 5 Aaron Williams 2023-03-28 07:07:13 UTC
balooctl indexSize
File Size: 8.18 GiB
Used:      3.94 GiB

           PostingDB:       1.12 GiB    28.433 %
          PositionDB:       1.91 GiB    48.445 %
            DocTerms:     641.15 MiB    15.887 %
    DocFilenameTerms:      65.43 MiB     1.621 %
       DocXattrTerms:       4.00 KiB     0.000 %
              IdTree:      17.61 MiB     0.436 %
          IdFileName:      73.20 MiB     1.814 %
             DocTime:      44.43 MiB     1.101 %
             DocData:      63.86 MiB     1.582 %
   ContentIndexingDB:      13.13 MiB     0.325 %
         FailedIdsDB:       4.00 KiB     0.000 %
             MTimeDB:      14.30 MiB     0.354 %
Comment 6 tagwerk19 2023-03-28 22:10:38 UTC
(In reply to Aaron Williams from comment #2)
> ... hence I use XFS. 
It might still be worth the "sanity check" of running, for a file you know has been indexed:

    baloosearch -i filename:the-file-name

and see whether you get just the one hit. The number is the "docID", a combination of the device and inode number. If you see several hits with different docID's, you are on shifting sands. I'm afraid I don't know how XFS behaves (and whether you have any extra layers) but it's worth keeping an eye on the results from

    stat the-file-name

and see if the device and inode numbers for the file change (for example over reboots).

Igor Poboiko has a python "baloo-checkdb.py" script that checks the index for consistency

        https://invent.kde.org/frameworks/baloo/uploads/bdc9f5f17fc96490b7bd4a22ac664843/baloo-checkdb.py

See https://invent.kde.org/frameworks/baloo/-/merge_requests/87#note_535270 for context

I would worry about trying it on such a large index, but who knows...
Comment 7 Aaron Williams 2023-03-29 04:19:46 UTC
./baloo-checkdb.py
Loading DB from /home/aaronw/.local/share/baloo/index...
Traceback (most recent call last):
  File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 265, in <module>
    db.check()
  File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 235, in check
    self.load_all()
  File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 52, in load_all
    self._load_posting()
  File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 87, in _load_posting
    for (key, value) in txn.cursor():
lmdb.PageNotFoundError: mdb_cursor_get: MDB_PAGE_NOTFOUND: Requested page not found

Tracking it down it looks like LMDB is returning MDB_PAGE_NOTFOUND, indicating the database is corrupt.
Note that I have seen the Baloo database suddenly grow to 4PiB in size several times.
Comment 8 tagwerk19 2023-03-29 06:34:49 UTC
(In reply to Aaron Williams from comment #7)
> Tracking it down it looks like LMDB is returning MDB_PAGE_NOTFOUND,
> indicating the database is corrupt.
That seems conclusive :-(

I've noticed times where "having to index" a *very* large number of new files pushed baloo over the edge. At the moment it makes its internal list of files to index (in memory) and commits only when it's done. That can be a demanding on memory. I'm suspicious abut whether this works if extending into swap but could not reproduce the issue though...

Probably a not a problem for you with the amount of RAM you've got available.

Maybe time to dig up a backup and find an earlier
    .local/share/baloo
Comment 9 Aaron Williams 2023-03-29 15:37:50 UTC
I reset the index. It looks like the database became corrupted.  Even after recreating the database, I am getting this error again when it hits certain files like some of the large PDF documents I have. One PDF file it is crashing on is 11MiB in size. Again, it's the same realloc bug listed in the stack trace. I cannot provide the files that cause it to crash since they are proprietary documents.
Comment 10 Aaron Williams 2023-03-29 15:45:19 UTC
I also have not backed up the index file. It typically quickly grows fairly large so I specifically do not back it up with my backup system.
Comment 11 tagwerk19 2023-03-29 17:28:21 UTC
(In reply to Aaron Williams from comment #9)
> ... I am getting this error again when it hits certain files like some of the large PDF documents I have...
I have seen trouble with a PDF generated by:

    adobe psl 1.3e for canon

and managed to find a test case on the internet that also caused the crash:

    https://usermanual.wiki/m/638471663caae5d9a0e8cb8fbcdb7aef415557811467c8211b3956f6a5333e80.pdf

As per:

    https://invent.kde.org/frameworks/baloo/-/merge_requests/87

the PDF included non-printable index terms that led to index corruption (see also Bug 464226). The good news here is that the fix should arrive soon, in Frameworks 5.105. Perhaps this is the issue...
Comment 12 Aaron Williams 2023-03-29 17:56:22 UTC
I'm not entirely sure it's PDFs. Unfortunately when I get a stack trace it doesn't report the file that it died on. In the monitor I see it failing even on some html files.