Bug 394750

Summary: baloo_file fills RAM and disk for hours with no visible progress
Product: [Frameworks and Libraries] frameworks-baloo Reporter: Thaddee Tyl <thaddee.tyl>
Component: Baloo File DaemonAssignee: baloo-bugs-null
Status: RESOLVED WORKSFORME    
Severity: normal CC: adabreug94, dag, gwarser, hugojmaia, igor.poboiko, joh82875, mike.d.lui, nate, peter.mueller_1955, tagwerk19, themichaeleden, viniciusbrbio
Priority: NOR    
Version: 5.96.0   
Target Milestone: ---   
Platform: Neon   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Thaddee Tyl 2018-05-27 13:30:48 UTC
The baloo_file process has been running for five hours and uses about 4±2 GiB of RAM, causing swapping, and not a single file has been indexed yet:

$ balooctl -v
baloo 5.46.0
$ balooctl status
Baloo File Indexer is running
Indexer state: Initial Indexing
Indexed 0 / 0 files
Current size of index is 21.26 GiB
$ ps -C baloo_file -o comm,etime,%cpu,%mem,vsz,rss
COMMAND             ELAPSED %CPU %MEM    VSZ   RSS
baloo_file         05:09:33 43.6 32.5 274650148 3965904
$ ls -lh .local/share/baloo/index
-rw-rw-r-- 1 tyl tyl 22G May 27 14:04 .local/share/baloo/index

This link suggested I file this bug: https://community.kde.org/Baloo/Debugging.

I really like the idea of Baloo, so I wish for it to work a bit better.

I don't know how often Baloo works flawlessly. My setup is barely unusual: I have some directories with a million small files (records of Go games obtained from this command: https://github.com/espadrine/badukjs/blob/master/Makefile#L13), and some files which are quite big, like a few Linux .iso. In total, I have about 150 GiB in /home — including the 22 GiB of Baloo index, which is now a significant amount of "0 files indexed".

If that large folder and the iso are the files that baloo_file chokes on, could we make Baloo give up if it spends more than 10 seconds on a single file or folder? (An `ls` on the Go games folder takes 11 minutes.)

But really, I only care about indexing the contents of my PDFs and LibreOffice documents, and maybe my images. All told, a few thousand files.

Philosophically, it makes more sense to whitelist files by type than to index files that are unlikely to be properly read. Looking through the configuration parameters, it looks like files are blacklisted by type. It would make more sense to whitelist them: there are more file types that are unreadable than there are supported ones. Most users only care about indexing of .pdf, .docx and .jpg files, maybe a handful of others. I don't see a use-case for indexing an .iso file. Yet it is neither in excludeFilters nor in excludeMimetypes by default.

Aside. Is Baloo indexing file paths themselves? It would be both pretty inefficient and a duplication of effort, since mlocate does it stellarly and yet unnoticeably. /var/lib/mlocate is 98 MiB and `locate *.pdf` takes about a second to run.

Could we make Baloo stream its processing? For each file extension in the whitelist we discussed, it would regularly use locate(1) to get them, feed them to the content indexer if they were updated, and that's it.

Finally, when Baloo does pointless busywork, it would be welcome to have more debugging tools.
balooctl could have a command to debug what baloo_file is currently indexing.
Comment 1 Mike Lui 2018-05-27 14:20:11 UTC
I can confirm this issue as well on ArchLinux with the same version.

Originally my disk drive filled up, apparently causing the index db to get corrupted. 

baloo_file_extr was eating up cpu and my journal was filled with:
May 27 09:38:17 mymachine kdeinit5[2549]: ()
May 27 09:38:18 mymachine kdeinit5[2549]: ()
May 27 09:38:18 mymachine kdeinit5[2549]: ()
May 27 09:38:18 mymachine kdeinit5[2549]: ()
May 27 09:38:20 mymachine kdeinit5[2549]: ()
May 27 09:38:20 mymachine kdeinit5[2549]: ()
May 27 09:38:20 mymachine kdeinit5[2549]: ()

So I nuked baloo with `balooctl disable`.
After re-enabled it with `balooctl enable`. I'm seeing the behavior described in the original description.
Comment 2 Mike Lui 2018-05-27 14:51:48 UTC
Possibly related, balooctl stop/suspend/disable has no effect on the baloo_file process. I had to kill it manually.
Comment 3 Michael Eden 2018-06-25 03:33:24 UTC
I am also seeing this issue, baloo_file_extractor uses all of RAM and half of swap (20GB total) slowing down the entire desktop, I'll try to investigate further and see what's using up so much memory.
Comment 4 Igor Poboiko 2018-10-12 09:30:14 UTC
Could you(In reply to Thaddee Tyl from comment #0)
> If that large folder and the iso are the files that baloo_file chokes on,
> could we make Baloo give up if it spends more than 10 seconds on a single
> file or folder? (An `ls` on the Go games folder takes 11 minutes.)
I guess in that case Baloo does can choke on this directory. And, I guess, this setup is indeed somewhat unusual.
I suggest adding this folder to "exclude" list (it's available inside "systemsettings", in "Workspace -> Search -> File Search" category.

Does this Go game files have a special mime-type? 
We have a list of blacklisted mimetypes inside Baloo (which currently includes mostly source-code files), we can blacklist it by default (as it hardly contains useful information for indexing, right?)

> Philosophically, it makes more sense to whitelist files by type than to
> index files that are unlikely to be properly read. Looking through the
> configuration parameters, it looks like files are blacklisted by type. It
> would make more sense to whitelist them: there are more file types that are
> unreadable than there are supported ones. Most users only care about
> indexing of .pdf, .docx and .jpg files, maybe a handful of others. I don't
> see a use-case for indexing an .iso file. Yet it is neither in
> excludeFilters nor in excludeMimetypes by default.

Baloo relies on KFileMetaData framework to index files: if we can extract data from file, we do it. But it only supports use-cases relevant for users (documents, pictures, audio files, etc.). I'm not really sure iso-files are being indexed at all, as KFileMetaData does not support them (well, because there is not much to be extracted that is relevant for user...).

> Aside. Is Baloo indexing file paths themselves? It would be both pretty
> inefficient and a duplication of effort, since mlocate does it stellarly and
> yet unnoticeably. /var/lib/mlocate is 98 MiB and `locate *.pdf` takes about
> a second to run.
It does. Also, Baloo is supposed to be self-sufficient, it's not supposed to be used together with mlocate, it's a separate indexing system.

> Finally, when Baloo does pointless busywork, it would be welcome to have
> more debugging tools.
> balooctl could have a command to debug what baloo_file is currently indexing.
"balooctl monitor" does that.
(well, unfortunately it was not possible to print _current_ file being indexed, but this will be fixed in 5.52 release, see https://cgit.kde.org/baloo.git/commit/?id=a9696978322c08d19ece0a67f430aee391e3918d)
Comment 5 Nate Graham 2018-11-26 21:34:21 UTC
Michael, we'd be happy to add the go files to the blacklist if you can 1) confirm that their content is never worth indexing and 2) provide their file extension. Thanks!
Comment 6 Michael Eden 2018-11-28 15:21:29 UTC
Nate, I never found out what baloo was hanging on, Igor Poboiko found the Go  game files issue. Is there a way to see what baloo is doing (`balooctl status` isn't working, I guess I can strace?)
Comment 7 Nate Graham 2018-11-28 20:48:38 UTC
FWIW `balooctl status` should work much better in the upcoming KDE Frameworks 5.53, if you can update.
Comment 8 viniciusbr 2019-01-29 18:43:59 UTC
I found a similar problem with baloo 5.54.0 with kde neon 5.14 in a fresh installation. Baloo was consuming 100% of the cpu. I disabled it and the cpu back to normal.
Comment 9 Antoscha 2019-05-14 09:22:42 UTC
*** Bug 403866 has been marked as a duplicate of this bug. ***
Comment 10 p 2019-11-05 17:08:40 UTC
same issue here on a fresh install of opensuse 15.1. It kills my system completely. Makes baloo unusable. Sad. Considering that the report was open one and half years ago I assume we don't can hope for a fix. Sad.
Comment 11 Igor Poboiko 2019-11-08 11:20:56 UTC
(In reply to p from comment #10)
> same issue here on a fresh install of opensuse 15.1. It kills my system
> completely. Makes baloo unusable. Sad. Considering that the report was open
> one and half years ago I assume we don't can hope for a fix. Sad.

So, which files does it choke on? 
What does `balooctl status` / `balooctl monitor` report?
Comment 12 Nate Graham 2020-10-26 19:28:38 UTC
*** Bug 427819 has been marked as a duplicate of this bug. ***
Comment 13 tagwerk19 2021-08-01 10:37:44 UTC
(In reply to Thaddee Tyl from comment #0)
> ... My setup is barely unusual ...
May I read that as "My setup is fairly unusual"? 8-)

> ... I have some directories with a million small files (records of
> Go games obtained from this command:
> https://github.com/espadrine/badukjs/blob/master/Makefile#L13)

Wow...

Maybe some time has gone by and the number of recorded games has crept up but I've just downloaded and unpacked nearly 2 million .sgf files (that end up in a single, flat, directory).

That's going to be a torture test!

First off. Yes, I see the described behaviour:

    baloo_file fills RAM and disk for hours with no visible progress

This is with the current Neon Unstable...

    Plasma: 5.22.80
    Frameworks: 5.85.0
    Qt: 5.15.3
    Filesystem: Ext4 

This hadn't been marked "Confirmed" but, yes, reproducible...

Digging down into the "torture test"; extracting the files from the tar archives overwhelms iNotify. Baloo reports

    Inotify - too many event - Overflowed

Baloo attempts to index the files where it get the notification, but it will only discover "the remainder" on a "balooctl check" or on the next logon.

I see "baloo_file" running at 100% and with steadily growing memory use. It's listing all the files it will need to index (it's not got as far as indexing content). However I see the same behaviour with content indexing disabled, so it is an issue with baloo_file and not baloo_file_extractor.

It seems that baloo_file wants to build the list of unindexed files as a single transaction. "balooctl check" does not show anything happening; the information is being collected but not appearing on disc.

Testing on a VM with 16 GB RAM, I could index 1.4 million files (it took almost an hour, without content indexing) and it was possible to see the memory use creeping up during the process and the results committed to disc right at the end.

With the full 2 million files, it filled RAM and swap in 90 minutes and baloo_file hung with what looked like a corrupt/truncated index written to disc (the filesize of index was the size of RAM. Interesting but maybe a coincidence)

It was possible to index the full 2 million files if they were copied "in batches" into an indexed directory and baloo_file allowed to catch up after each copy.

I think there is something to be fixed here...

    When baloo is indexing content it does it with batches of files
    (40 files, then the next 40 and so on) and commit the results after
    each. It would make sense to batch the initial indexing, something
    like a commit every 15 seconds perhaps. That would also allow people
    to see that something was happening with "balooctl status"

More speculatively...

    The "40 file" batches for content indexing is very, very low for the
    small .sgf files; the full text index would take days (weeks?) to
    complete. This limit can shrink, maybe it should be allowed to grow
    as well.

I'd place the baloo_file and baloo_file_extractor issues into different pigeon holes here.
Comment 14 hugomaia 2022-07-14 12:18:15 UTC
I can confirm this is still a problem, RAM usage isn't really that much of an issue, it caps out at 4 GB out of the 16 GB my system has.
The problem is both disk usage and one CPU thread pegged at 100%.

I never had this issue until recently, but I can pin down exactly what caused Baloo to malfunction.

What I did:
A couple weeks ago I took up to modding Grim Dawn.
To modify anything in that game you're required to fully extract a database file which then becomes about 60k loose text files.
I then began using the search function in Kate to give me the ability to do mass edits on those files.
Since it was rapid file creation and deletion (Kate creates swap files every time you make a change to something and then deletes those when you save) and overall changes to a ton of files, that's when Baloo started going haywire on my system.

I hope this helps in re-enacting the issue with Baloo.
Comment 15 tagwerk19 2022-07-14 19:41:47 UTC
(In reply to hugomaia from comment #14)
> ... rapid file creation and deletion (Kate creates swap files every
> time you make a change to something and then deletes those when you save)
> and overall changes to a ton of files, that's when Baloo started going
> haywire on my system.
If edit a "testfile.txt" file, I see Kate creating a ".testfile.txt.kate-swp" (checking on Neon Testing)

The question is whether these are a problem...

    If I have content indexing and "Index hidden files and folders" enabled and run
    "balooctl monitor", I see baloo indexing the ".kate-swp" file periodically as the source
    file is edited.

    Checking my ~/.config/baloofilerc file, I see that "*.swp" files are excluded but there's
    no mention of "*.kate-swp"

You might check to see whether you are indexing hidden files and disable this if not needed. It would probably make sense to edit your ~/.config/baloofilerc and add an exclusion for "*.kate-swp".

I suppose that if you are modifying "a ton" of files, you might be giving baloo a lot to do. May also be that you are hitting Bug 442453, where baloo is having to delete "large numbers" of files.
Comment 16 tagwerk19 2022-07-16 14:21:08 UTC
(In reply to tagwerk19 from comment #15)
>     Checking my ~/.config/baloofilerc file, I see that "*.swp" files are
>     excluded but there's no mention of "*.kate-swp"
See also:
    https://bugs.kde.org/show_bug.cgi?id=269518#c9
Comment 17 tagwerk19 2024-07-04 07:05:11 UTC
(In reply to tagwerk19 from comment #13)
> I think there is something to be fixed here...
> 
>     When baloo is indexing content it does it with batches of files
>     (40 files, then the next 40 and so on) and commit the results after
>     each. It would make sense to batch the initial indexing, something
>     like a commit every 15 seconds perhaps. That would also allow people
>     to see that something was happening with "balooctl status"

That's been done, there's been a patch to limit memory use with systemd/cgroups:
     https://invent.kde.org/frameworks/baloo/-/merge_requests/121
Together with a fix speculated about above, to commit regularly during the initial indexing
     https://invent.kde.org/frameworks/baloo/-/merge_requests/148

I'll leave this "Waiting for Info" but I think it can probably be closed...
Comment 18 Bug Janitor Service 2024-07-19 03:46:35 UTC
Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!
Comment 19 Bug Janitor Service 2024-08-03 03:46:29 UTC
🐛🧹 This bug has been in NEEDSINFO status with no change for at least 30 days. Closing as RESOLVED WORKSFORME.