Summary: | Realloc failure on certain files | ||
---|---|---|---|
Product: | [Frameworks and Libraries] frameworks-baloo | Reporter: | Aaron Williams <aaronw> |
Component: | Baloo File Daemon | Assignee: | baloo-bugs-null |
Status: | REPORTED --- | ||
Severity: | crash | CC: | gerrit.huebbers, tagwerk19 |
Priority: | NOR | Keywords: | drkonqi |
Version First Reported In: | 5.104.0 | ||
Target Milestone: | --- | ||
Platform: | openSUSE | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: |
Description
Aaron Williams
2023-03-27 22:23:26 UTC
This is an offchance... There was another issue with extracting text from PDFs here (in this case a PDF that contained a scientific plot, something generated by R) in Bug 380456. It didn't fail, as far as I remember, but took days to index. Check that pdftotext is able to extract the text... https://bugs.kde.org/show_bug.cgi?id=380456#c21 Be aware that you may also be running into issues with BTRFS and be reindexing your folders as they appear with different Device ID's. You might have a wildly big (and possibly corrupt) index file Have a look at https://bugs.kde.org/show_bug.cgi?id=400704#c31 I am not running BTRFS. I've had too many bad experiences with it and lost too many BTRF filesystems in the past to trust it, plus it has this habit of suddenly going out to lunch to do some processing. I have many TiB of data, hence I use XFS. pdftotext seems to work. Many of the files that are listed as failed are schematic files which are mostly lines and graphics, but others are long documents that are thousands of pages in length which are mostly text. Another common element I see in the failed files is that they have a space in the filename, but this is only some of them. In this case, I am getting the above trap in Baloo where it's failing in realloc, likely due to heap corruption. I might add that I do not think it is memory constrained. The system this is running on has 128GiB of RAM and significant swap. I do not see baloo consuming excessive memory, though the index file is a bit over 60 GiB. Given that it's going through around 1.5M files and many TiB of data I am not surprised. While I had been seeing these periodic failures earlier, something definitely got goofed up. balooctl status kf.i18n: KLocalizedString: Using an empty domain, fix the code. msgid: "Unknown" msgid_plural: "" msgctxt: "" kf.i18n: KLocalizedString: Using an empty domain, fix the code. msgid: "Indexing file content" msgid_plural: "" msgctxt: "" Baloo File Indexer is running Indexer state: Indexing file content Total files indexed: 854,351 Files waiting for content indexing: 460,114 Files failed to index: 35 Current size of index is 3.98 PiB 8.2G -rw-r--r-- 1 XXXXXX users 4.0P Mar 28 00:04 index I don't think the index file is supposed to grow to 4.0 PiB, even if it is sparse! Note that this growth happened after I reported this problem. balooctl indexSize File Size: 8.18 GiB Used: 3.94 GiB PostingDB: 1.12 GiB 28.433 % PositionDB: 1.91 GiB 48.445 % DocTerms: 641.15 MiB 15.887 % DocFilenameTerms: 65.43 MiB 1.621 % DocXattrTerms: 4.00 KiB 0.000 % IdTree: 17.61 MiB 0.436 % IdFileName: 73.20 MiB 1.814 % DocTime: 44.43 MiB 1.101 % DocData: 63.86 MiB 1.582 % ContentIndexingDB: 13.13 MiB 0.325 % FailedIdsDB: 4.00 KiB 0.000 % MTimeDB: 14.30 MiB 0.354 % (In reply to Aaron Williams from comment #2) > ... hence I use XFS. It might still be worth the "sanity check" of running, for a file you know has been indexed: baloosearch -i filename:the-file-name and see whether you get just the one hit. The number is the "docID", a combination of the device and inode number. If you see several hits with different docID's, you are on shifting sands. I'm afraid I don't know how XFS behaves (and whether you have any extra layers) but it's worth keeping an eye on the results from stat the-file-name and see if the device and inode numbers for the file change (for example over reboots). Igor Poboiko has a python "baloo-checkdb.py" script that checks the index for consistency https://invent.kde.org/frameworks/baloo/uploads/bdc9f5f17fc96490b7bd4a22ac664843/baloo-checkdb.py See https://invent.kde.org/frameworks/baloo/-/merge_requests/87#note_535270 for context I would worry about trying it on such a large index, but who knows... ./baloo-checkdb.py Loading DB from /home/aaronw/.local/share/baloo/index... Traceback (most recent call last): File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 265, in <module> db.check() File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 235, in check self.load_all() File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 52, in load_all self._load_posting() File "/fast2/aaronw/programming/kde/./baloo-checkdb.py", line 87, in _load_posting for (key, value) in txn.cursor(): lmdb.PageNotFoundError: mdb_cursor_get: MDB_PAGE_NOTFOUND: Requested page not found Tracking it down it looks like LMDB is returning MDB_PAGE_NOTFOUND, indicating the database is corrupt. Note that I have seen the Baloo database suddenly grow to 4PiB in size several times. (In reply to Aaron Williams from comment #7) > Tracking it down it looks like LMDB is returning MDB_PAGE_NOTFOUND, > indicating the database is corrupt. That seems conclusive :-( I've noticed times where "having to index" a *very* large number of new files pushed baloo over the edge. At the moment it makes its internal list of files to index (in memory) and commits only when it's done. That can be a demanding on memory. I'm suspicious abut whether this works if extending into swap but could not reproduce the issue though... Probably a not a problem for you with the amount of RAM you've got available. Maybe time to dig up a backup and find an earlier .local/share/baloo I reset the index. It looks like the database became corrupted. Even after recreating the database, I am getting this error again when it hits certain files like some of the large PDF documents I have. One PDF file it is crashing on is 11MiB in size. Again, it's the same realloc bug listed in the stack trace. I cannot provide the files that cause it to crash since they are proprietary documents. I also have not backed up the index file. It typically quickly grows fairly large so I specifically do not back it up with my backup system. (In reply to Aaron Williams from comment #9) > ... I am getting this error again when it hits certain files like some of the large PDF documents I have... I have seen trouble with a PDF generated by: adobe psl 1.3e for canon and managed to find a test case on the internet that also caused the crash: https://usermanual.wiki/m/638471663caae5d9a0e8cb8fbcdb7aef415557811467c8211b3956f6a5333e80.pdf As per: https://invent.kde.org/frameworks/baloo/-/merge_requests/87 the PDF included non-printable index terms that led to index corruption (see also Bug 464226). The good news here is that the fix should arrive soon, in Frameworks 5.105. Perhaps this is the issue... I'm not entirely sure it's PDFs. Unfortunately when I get a stack trace it doesn't report the file that it died on. In the monitor I see it failing even on some html files. |