Bug 444520

Summary:	Baloo content indexing resurrects itself after being killed and disabled
Product:	[Applications] systemsettings	Reporter:	Adam Fontenot <adam.m.fontenot+kde>
Component:	kcm_baloo	Assignee:	baloo-bugs-null
Status:	REPORTED ---
Severity:	normal	CC:	nate, plasma-bugs-null, tagwerk19
Priority:	NOR
Version First Reported In:	5.23.2
Target Milestone:	---
Platform:	Arch Linux
OS:	Linux
See Also:	https://bugs.kde.org/show_bug.cgi?id=472954
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	log of requested debugging commands

Description Adam Fontenot 2021-10-28 07:07:13 UTC

SUMMARY
Just in time for Halloween, Baloo returns from the dead to haunt the living.

Suppose a user encounters one of the many common performance issues caused by Baloo, like the one I describe here: https://bugs.kde.org/show_bug.cgi?id=380456#c14

They are likely to try to pause Baloo using system settings, but this (frequently? always?) doesn't work, as I describe in this bug: https://bugs.kde.org/show_bug.cgi?id=443693

So they're likely to try disable indexing file content (while leaving "Enable File Search" checked. They may even `kill -9 baloo_file_extractor` and reboot just to be sure it's dead.

Sooner or later, whenever Baloo kicks back in, it may also restart indexing file content, despite being disabled. On the computer this happened on, I caught baloo_file_extractor hanging (again) with 100% CPU use and several GB of memory eaten on one particular PDF file, the same issue that triggered my comment here: https://bugs.kde.org/show_bug.cgi?id=380456#c14

To state the obvious, Baloo should *never* resurrect an already-killed file extraction if "index file content" is disabled.

STEPS TO REPRODUCE
1. Enable File Search and content indexing in system settings.
2. Add a file to a directory indexed by Baloo which will cause baloo_file_extractor to hang for a while and consume system resources. (There are lots of examples of such files; because Baloo uses external libraries to extract file contents, e.g. poppler, it's dependent on them to be well behaved with the files they open.)
3. While baloo_file_extractor is trying to extract text from the file, kill it manually and disable file content indexing in the settings.
4. Wait a while. (In my case, at least one reboot had gone by since disabling content indexing.)

OBSERVED RESULT
Baloo appears to resume the partially completed indexing process that the user previously killed, including indexing files - in particular the file or files that were causing problems for the indexer.

EXPECTED RESULT
Baloo should respect the changed settings and not try to index the content of any files.

SOFTWARE/OS VERSIONS
Linux: Arch Linux x86_64 (kernel 5.14.14)
KDE Plasma Version: 5.23.2
KDE Frameworks Version: 5.87.0
Qt Version: 5.15.2

Comment 1 Nate Graham 2021-10-28 14:31:45 UTC

kreadconfig5 --file baloofilerc --group "Basic Settings" --key Indexing-Enabled

systemctl --user status plasma-plasmashell.service

systemctl --user status kde-baloo.service

systemctl --user status plasma-baloorunner.service

Comment 2 Bug Janitor Service 2021-11-12 04:39:21 UTC

Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!

Comment 3 Adam Fontenot 2021-11-16 00:02:50 UTC

Created attachment 143606 [details]
log of requested debugging commands

(In reply to Nate Graham from comment #1)
> kreadconfig5 --file baloofilerc --group "Basic Settings" --key
> Indexing-Enabled
> 
> systemctl --user status plasma-plasmashell.service
> 
> systemctl --user status kde-baloo.service
> 
> systemctl --user status plasma-baloorunner.service

I am attaching the output as a text file, since the output from systemctl is extremely long.

Note that I reported this bug after the issue occurred, and when that happened I once again killed Baloo and (if I recall correctly) deleted all of its cached files. The issue has not reoccured since then. So I'm not sure whether this information will be useful.

That said, I have not changed any settings since the issue originally appeared. File content indexing continues to be disabled in the settings window, while basic indexing is enabled. The first command you gave prints no output (even when I correct the file name of the baloofilerc file to be the one on my system). However, examining the file, I see that "only basic indexing" is set to true.

Note: my best guess for the cause of this issue is that if Baloo has a file content indexing operation in progress, this operation is terminated, and then file content indexing is disabled, in at least certain cases when normal file indexing is resumed thereafter, it will also resume trying to index the content of the file / files that were in progress previously. If those files were causing the file indexer to hang, Baloo will also hang once again. For that reason, disabling file content indexing *and* deleting Baloo's cache (which had grown to an enormous size on this small SSD) prevented the problem from appearing again, because any leftovers from the in-progress indexing operation were deleted.

Comment 4 tagwerk19 2021-11-16 09:34:28 UTC

(In reply to Adam Fontenot from comment #0)
> ... Just in time for Halloween... try to pause Baloo using system settings, but this
> (frequently? always?) doesn't work, as I describe in this bug: Bug 443693
Also noticed this, baloo_file seems not to respond to events while waiting for baloo_file_extractor to complete. Complicated by the fact that baloo_file_extractor indexes batches of files (40 files, and then the next 40, and then...)

> Sooner or later, whenever Baloo kicks back in, it may also restart indexing
> file content, despite being disabled.
Baloo should really not restart on its own. If disabled it should stay disabled - although if the baloo_file process died or was killed it would be restarted (at least) at the next logon. 

> To state the obvious, Baloo should *never* resurrect an already-killed file
> extraction if "index file content" is disabled.
Agreed. But that's tricky.

The indexing has to recognise whether it's interrupted, say, from a log out or closedown (in which case it should quietly continue from where it was when you've logged on again) or you have disabled indexing / forceably killed the process (in which case, baloo's not going to know why).

It should be that you can get a list of "failed" indexings with "balooctl failed" but I've not had a lot of luck with that - and there should probably be a manual way of flagging a file as "avoid/failed"

> OBSERVED RESULT
> Baloo appears to resume the partially completed indexing process that the
> user previously killed, including indexing files - in particular the file or
> files that were causing problems for the indexer.
Would need someone who knows the code here: whether baloo_file flags the files as "to be indexed" before it passed them to baloo_file_extractor. If that's the case it could be that baloo wants to complete "that" job...

> ... On the computer this happened on, I
> caught baloo_file_extractor hanging (again) with 100% CPU use and several GB
> of memory eaten on one particular PDF file, the same issue that triggered my
> comment here: https://bugs.kde.org/show_bug.cgi?id=380456#c14

(In reply to Adam Fontenot from comment #3)
> Note: my best guess for the cause of this issue is that if Baloo has a file
> content indexing operation in progress, this operation is terminated, and
> then file content indexing is disabled, in at least certain cases when
> normal file indexing is resumed thereafter, it will also resume trying to
> index the content of the file / files that were in progress previously. If
> those files were causing the file indexer to hang, Baloo will also hang once
> again. For that reason, disabling file content indexing *and* deleting
> Baloo's cache (which had grown to an enormous size on this small SSD)
> prevented the problem from appearing again, because any leftovers from the
> in-progress indexing operation were deleted.
I think that's true.

I'd also suspect the "very large" PDF being the reason for the large index (baloo will write a reverse index entry for each of the "random words"), however there are other things that can also trigger the index to balloon in size.

It's not always clear whether the delay in content indexing comes from the extraction of the index terms from the original files or that there's a large transaction being prepared (and once the index file has "got big" this can be very memory intensive). I've found iotop useful to follow the reads/writes.

Comment 5 Adam Fontenot 2021-11-16 23:41:11 UTC

(In reply to tagwerk19 from comment #4)
Just a couple of clarifications:
> > Sooner or later, whenever Baloo kicks back in, it may also restart indexing
> > file content, despite being disabled.
> Baloo should really not restart on its own. If disabled it should stay
> disabled - although if the baloo_file process died or was killed it would be
> restarted (at least) at the next logon. 
To be clear, the issue here is that *content* indexing was disabled (the setting "also index file content") in the file search settings, and baloo_file_extractor was killed. Normal indexing (just the file name / file search option) was *not* disabled. And so the specific issue I'm seeing here is that when the normal indexing resumes (possibly after a reboot, possibly not), baloo_file_extractor starts trying to index file content again despite that feature being disabled in the settings.

> I'd also suspect the "very large" PDF being the reason for the large index
> (baloo will write a reverse index entry for each of the "random words"),
> however there are other things that can also trigger the index to balloon in
> size.
The PDF in question is only 20 MB. It's "large" only in the sense that it has a ton of indexable text (according to the poppler devs). This suggests another pretty obvious heuristic in addition to those I mentioned in the bug report on Baloo memory use:

If the index for a file grows to be larger than the original file, kill the extraction process, add the file to a list of failed files, delete the index for it, and don't try indexing the content of the file again. 

I realize the reality might be a bit more complicated than "just" doing that, but at the end of the day Baloo desperately needs some better heuristics given the large number of resource consumption issues users report with it.

Comment 6 tagwerk19 2021-11-17 09:13:09 UTC

(In reply to Adam Fontenot from comment #5)
> To be clear ... the specific issue I'm seeing here
> is that when the normal indexing resumes (possibly after a reboot, possibly
> not), baloo_file_extractor starts trying to index file content again despite
> that feature being disabled in the settings.
Thanks!

The implication, as I see it, is that baloo_file flags (in its index) that it's queued a batch of files for content indexing. That feels, well, strange :-/

> The PDF in question is only 20 MB. It's "large" only in the sense that it
> has a ton of indexable text (according to the poppler devs). This suggests
> another pretty obvious heuristic in addition to those I mentioned in the bug
> report on Baloo memory use:
I know there's a rough 10Mbyte limit for .txt and .html, see https://bugs.kde.org/show_bug.cgi?id=410680#c7. Files larger are not content indexed. There is an existing rationale for such a limit.

I'm happy to check behaviour if you can generate a test PDF/SVG and upload/attach it

> If the index for a file grows to be larger than the original file, kill the
> extraction process, add the file to a list of failed files, delete the index
> for it, and don't try indexing the content of the file again. 
I don't think there's an easy relation between the size of the source and the size of the index. The index contains "lookups", you type a search term and a list of hits gets pulled off disc. The design decision was for speed; you get a refined list of hits in Dolphin as you type more letters into the search box or view your files in folders based on the tags you've given them.

Comment 7 Adam Fontenot 2021-11-18 01:34:08 UTC

(In reply to tagwerk19 from comment #6)
> I'm happy to check behaviour if you can generate a test PDF/SVG and
> upload/attach it
Here's the original file that caused the problem: https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC

Note that it's likely to cause problems if you download it on a system with Baloo's content indexing enabled. So be cautious. And to be clear, I don't have any direct evidence to suggest that this file was responsible for Baloo's balooning index.

> > If the index for a file grows to be larger than the original file, kill the
> > extraction process, add the file to a list of failed files, delete the index
> > for it, and don't try indexing the content of the file again. 
> I don't think there's an easy relation between the size of the source and
> the size of the index. The index contains "lookups", you type a search term
> and a list of hits gets pulled off disc. The design decision was for speed;
> you get a refined list of hits in Dolphin as you type more letters into the
> search box or view your files in folders based on the tags you've given them.
That's a fair point. Let me put it a different way. 

The laptop in question has a 128 GB SSD. That's not an uncommon size for inexpensive laptops that come with an SSD. A user might reserve 50 GB or so for their home partition, and have let's say 35 GB of files on it. My point is just that it's *understandable* for such a user to be upset about an automatically enabled system component randomly deciding to use 5+ GB of the remaining free space. SSDs mean that storage is now often at a premium again, and many users will not be willing to trade a large percent of free space for slightly faster / better file searches.

So while I can't speak to the internal architecture or tradeoffs of Baloo, I can say from a user perspective that an index of files using more than 10% of the total size of those files feels really bad. If there's a good reason that you can't guarantee that the index for a file isn't larger than the file, let me suggest an alternative. Is the algorithm Baloo uses to decide whether to create a content index for a file tunable at all? Perhaps an option to limit the size of the Baloo cache could be provided: either X GB or X% of free space. Given the available space, Baloo could manage its storage to not index files that are less usefully indexed. E.g. if there's one file that is 20 MB but using 2 GB of index space, it's going to be the first to go.

At any rate, my reasoning for limiting the size of the index on original files was that it's a pretty good heuristic for filtering out files that don't contain indexable content. For example, biologists frequently use plain text "SAM" files, which contain long strings of meaningful but not indexable text, representing bits of DNA and metadata. E.g. "ATAGCACTCAAGCAATCAAATCAAATAGCCAACTCCTTATCTCAACTCTCC". These files might be under 10 MB, and they might have a .sam, .txt, or no extension at all. Obviously such files should not be indexed, but it's difficult for a user to ensure they're eliminating them all. This goes back to the "just works" principle: insofar as possible, content indexing should quietly make searches better without ever significantly impacting system resources. This implies the need for heuristics to prevent indexing files like this.

Comment 8 tagwerk19 2021-11-18 11:29:05 UTC

(In reply to Adam Fontenot from comment #7)
> Here's the original file that caused the problem:
> https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC
You are right about the warning... also best not to open the link with a browser that wants to render PDF's itself 8-]

Yes, 20MB but the plot content is compressed, as plaintext it could be *very* much larger. It is titled "R Graphics Output" so maybe there's a possibility to recognise such files - even though I'm sure "R" allows you to set a title yourself.

> That's a fair point. Let me put it a different way. 
Good arguments...

> ... Perhaps an
> option to limit the size of the Baloo cache could be provided: either X GB
> or X% of free space. Given the available space, Baloo could manage its
> storage to not index files that are less usefully indexed. E.g. if there's
> one file that is 20 MB but using 2 GB of index space, it's going to be the
> first to go ...
I don't know "the internals" well enough to say. I do know that the underlying library (LMDB) is designed withstand normal desktop misuse (killing processes, turning things off in the middle of an update). You can get times when the index grows because a transaction is being appended while another process is reading the index... Another design decision.

For the 20MB PDFs, it may be that indexing the first file generates a 2 GB index but the second one only adds a few additional MB. There's no guessing with edge cases...

> ... For example, biologists frequently use
> plain text "SAM" files, which contain long strings of meaningful but not
> indexable text, representing bits of DNA and metadata. E.g.
> "ATAGCACTCAAGCAATCAAATCAAATAGCCAACTCCTTATCTCAACTCTCC". These files might be
> under 10 MB, and they might have a .sam, .txt, or no extension at all.
In this case, I'd hope that SAM files have their own Mimetype (although looks like not... perhaps possible to build a rule if the files follow the "Recommended Practice").

I know the SAM files were just an example but if you _did_ want to index them, you'd hit baloo's "25 character limit" (Bug 412421) :-/  See this with:

    $ echo "abcdefghijklmnopqrstuvwxyz" > testfile.txt

    $ balooshow -x testfile.txt
    13fc000000fc01 64513 1309696 testfile.txt [/home/user/Documents/testfile.txt]
            Mtime: 1637231394 2021-11-18T11:29:54
            Ctime: 1637231394 2021-11-18T11:29:54
            Cached properties:
                    Line Count: 1

    Internal Info
    Terms: Mplain Mtext T5 T8 X20-1 abcdefghijklmnopqrstuvwxy
    File Name Terms: Ftestfile Ftxt
    XAttr Terms:
    lineCount: 1

    $ baloosearch abc
    /home/user/Documents/testfile.txt
    Elapsed: 0.31964 msecs

    $ baloosearch abcdefghijklmnopqrstuvwxyz
    Elapsed: 0.215223 msecs

So there's compromises here as well.... In a way it's a question of what you mean by "just works"....