Bug 492808 - Baloo keeps writing to disk with 50MB/s for hours non-stop, causing massive system lags
Summary: Baloo keeps writing to disk with 50MB/s for hours non-stop, causing massive s...
Status: RESOLVED WORKSFORME
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 6.5.0
Platform: Other Linux
: NOR grave
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-09-08 12:28 UTC by Ellie
Modified: 2024-10-10 03:47 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ellie 2024-09-08 12:28:22 UTC
SUMMARY

Baloo keeps writing to disk with 50MB/s for hours non-stop, causing massive system lags. Simply launching a new terminal would sometimes hang for 5 seconds. I am suggesting severity "grave" since if I hadn't caught this after some hours, which any inexperienced users may not have, I am assuming after some days this could have damaged the SSD permanently. My apologies if I'm simply missing something here, but I'm pretty sure that was a real possibility.

STEPS TO REPRODUCE

1. Find entire system lagging to the point where most things accessing the disk cause 5 second hangs
2. Install iotop
3. Find baloo_file_extractor with constant 50MB/s disk writes

OBSERVED RESULT

baloo seems to completely overwhelm disk with writes for hours(!) without break at an insane throughput

EXPECTED RESULT

baloo behaves properly and doesn't have a seemingly high risk of damaging the storage, let alone also making the system a pain to use with constant 5-10 second freezes

SOFTWARE/OS VERSIONS

Windows: 
macOS: 
(available in the Info Center app, or by running `kinfo` in a terminal window)
Linux/KDE Plasma: postmarketOS Edge
KDE Plasma Version: 6.1.4
KDE Frameworks Version: 6.5.0
Qt Version: 6.7.2

ADDITIONAL INFORMATION
Comment 1 Ellie 2024-09-08 12:53:31 UTC
From my calculations, it may have written around 20TB to this disk until I tracked down what that problem was after around 5 hours:

>>> (50000000 * 60 * 60 * 24 * 5) / (1024 * 1024 * 1024)
20116.567611694336

Even assuming it had some minor pauses and might be a factor 10x too high, which I have no reason to assume, 2TB written is still something a buggy program shouldn't just do on an SSD on a random day uninstructed.

While this could be easily not have been a problem if the writing had been to a RAM disk, iotop and iostat suggested it was done to the actual disk so this doesn't seem to have been the case here.
Comment 2 Ellie 2024-09-08 12:55:47 UTC
Small update: after some digging around I found that the disk self-reports around 8TB written, so given there was some previous usage, the actual amount may have been around 2-3TB instead. Nevertheless, that still doesn't seem like a great situation.
Comment 3 tagwerk19 2024-09-08 14:32:55 UTC
(In reply to Ellie from comment #0)
> 3. Find baloo_file_extractor with constant 50MB/s disk writes
The mention of "baloo_file_extractor" means you are in the middle of indexing content... Alternative would be baloo_file catching up with moves and deletes.

See if "balooctl monitor" tells you what file is being indexed, whether it is a single file repeatedly indexed or you are in the middle of reindexing everything?

If it is a single file, might it be continuously changing (a log file?). Watch out if you are indexing hidden files and folders, there may be files that are hidden for a reason. Ditto if you have just installed wine or steam, they can install large numbers of strange files...

> 1. Find entire system lagging to the point where most things accessing the disk cause 5 second hangs
The times I've noticed that are when the extractor crashes and the system is repeated creating crash dumps. That can really slug performance. Secondly if the reindexing forces the system to swap, you do *not* want that to happen...

Have a look and see what:
    systemctl --user status kde-baloo
tells you....
Comment 4 Ellie 2024-09-08 14:53:43 UTC
The process listed in iotop was "baloo_file_extractor" specifically. The write amount just looked way too large for the index file, which I found later to be less than 2GB total, since that's what KDE system settings claimed when I fully deleted and wiped it. Is there any reasonable scenario where that leads to a constant 50MB/s over hours, so badly that the entire system freezes up all the time? I don't really think there is.

I don't think postmarketOS Edge collects crash dumps, and at least as of now it doesn't have systemd support. It uses OpenRC.
Comment 5 tagwerk19 2024-09-08 16:17:25 UTC
(In reply to Ellie from comment #4)
> ... postmarketOS Edge ... It uses OpenRC ...
My apologies, I missed the line in your summary.

I'm not sure whether that gives you the same functionality (of limiting the use of resources by setting up cgroups). On systemd based systems, Baloo is limited in the amount of memory it can "grab" though a unit file, you can say it should not use more than 25% of RAM and never swap. That's the 'backstop" protection, the OS prevents the process taking more...

It would still be worth running "balooctl monitor", that ought to give you the stream of filenames as Baloo indexes them. It's possible that you'll see a pattern. Things like a log file being indexed over and over again... or Baloo gradually working its way though a massive Maildir folder or stuck indexing a gigabyte .mbox file
Comment 6 Ellie 2024-09-08 17:00:20 UTC
The overall system load was rather low, only the write I/O seems to have been maxed out to the point of all I/O terribly freezing up. Given for how long it was like that, I assume it must have continuously rewritten something in the index or written way too much into a log, since the resulting index file was around 1000x smaller than what I'm guessing the write load would have roughly produced.

I don't think openrc has any cgroups limit functionality, although I hope baloo wouldn't be written to entirely rely on that. Nevertheless, unless that limits write throughput, it likely wouldn't have mattered here.

Since I deleted the index and baloo however, sort of in a panic I have to admit, I'm not sure I can easily get it back to the previous state.
Comment 7 tagwerk19 2024-09-08 18:31:49 UTC
(In reply to Ellie from comment #6)
> ... The overall system load was rather low, only the write I/O seems to have been
> maxed out to the point of all I/O terribly freezing up ...
I think the "watch points" are if Baloo is taking all the RAM and squeezing out the rest of the system (the cgroups "cap" on memory usage prevents that), that Baloo is forced to swap (it's being asked to do too much given the memory it has), it is repeated indexing a log file or it was really given a big collection of files to index (wine or steam being cases in point).

If you've killed Baloo, purged the index and are reindexing, you might see the behaviour resurface.

One additional question, you are not using BTRFS? There were previously issues with it and Baloo, mainly affecting OpenSUSE and Fedora. They were fixed a year or so back but all the same it is prudent to ask....

> ... Given for how long it
> was like that, I assume it must have continuously rewritten something in the
> index or written way too much into a log, since the resulting index file was
> around 1000x smaller than what I'm guessing the write load would have roughly
> produced ...
A bit stuck without knowing what was being indexed: what "balooctl monitor" or debugging would have shown.

> ... I don't think openrc has any cgroups limit functionality, although I hope baloo
> wouldn't be written to entirely rely on that. Nevertheless, unless that limits
> write throughput, it likely wouldn't have mattered here ...
It depends. It would not have made a difference to the total writes, it might have affected when and how much swapping was happening. It probably would have avoided the slowdown of the system you were seeing...

From what I've been told, it's not particularly easy for a process to see what impact it is having on the system - whether it is using too much RAM, forcing the system to swap, or waiting too long for a write to complete. Baloo is exceptionally careful with CPU and backs off when you are doing anything on the system - you would see it indexing 40 files, writing the results to disc, waiting a second or so and repeating...

> ... Since I deleted the index and baloo however, sort of in a panic I have to
> admit, I'm not sure I can easily get it back to the previous state ...
That's fully understood :-)
Comment 8 Ellie 2024-09-08 19:14:45 UTC
I'm using BTRFS for most of my partitions, actually. A smaller part of the storage uses XFS, although I plan to migrate it at some point. The current kernel version is 6.10.8 if you're curious whether it possibly contains any relevant fixes or not. I checked the system load including RAM and it was definitely very low, it was just completely saturating disk I/O with writes for hours in what seemed like a quite disproportionate manner to the actual index size.
Comment 9 tagwerk19 2024-09-08 20:18:11 UTC
(In reply to Ellie from comment #8)
> ... I'm using BTRFS for most of my partitions, actually ...
It's possible that was the root cause - although it depends a bit on how it was configured.

If you pick one of your files on an indexed BTRFS partition at run "stat":
    $ stat one-of-your-files.txt
and note down the device number and inode. Then after a reboot, do the same and compare. Maybe checking after each reboot for a while. You are watching to see if you get new/different device numbers on each reboot.
    
If you mount a BTRFS partition, you don't (necessarily) get the same device number. Baloo, previously, depended on a combination of the device number and inode as an internal "ID" for the file. If the device number changed, Baloo thought it had a set of new, unindexed, files and indexed them again. It now digs deeper to get to the FileSystem ID which is invariant.
    
This means you might have had "some history" in your index and a load of reindexing after a reboot. The total writes were bad because of the reindexing, my guess is you will be OK if you reindex now. Indexing will be faster, the index size smaller and the impact on the system less.
Comment 10 Ellie 2024-09-08 22:16:18 UTC
The system was running for 5 hours with 20MB/s write almost constantly, which if you calculate it comes out in the terabyte range, not the less than 2GB gigabyte index file that it produced by the time I shot it down. So I'm not sure this sounds like a changing device id issue.
Comment 11 tagwerk19 2024-09-09 06:30:19 UTC
(In reply to Ellie from comment #10)
> ... So I'm
> not sure this sounds like a changing device id issue ...
Yes, unfortunately. It's quite likely to be.
    
It's the way Baloo works. It is lightning fast when you do searches because if it has a word to search for, it looks up where on disk (in the index file) the list of results are and then pulls up that list as a page. Very little thinking, very little overhead. However that means when it indexing and it wants to update the index for that word, if pull up the list of results, inserts the new filename into the list and writes it back. For each word. You can imagine what happens for common words, it can be a very big list to read, update and rewrite.

Of course Baloo knows that this is slightly crazy and does not read and rewrite the lists for each word and file it indexes - it batches up the indexing into groups of 40 files, creates a transaction in memory for the 40 and commits. On top of that it's a memory mapped file which also cuts down on overhead (so the "reads" are mostly "find the page in memory" and "writes" and "flag this page as dirty so it will be written back with the commit"). However you can see here why Baloo is dependent on having enough RAM.
    
With the BTRFS bug you have a large index (all the results of all the files with their old ID's), and are indexing the files afresh, pulling up the old lists, inserting in the new ID's and writing the lists back. At some point the index will get too large for memory and performance will fall over the edge. The easy way of seeing if you are heading for trouble is a baloosearch:
    $ baloosearch -i one-of-your-files.txt
the "-i" asks for the ID. You should only get a single hit but with the BTRFS bug you typically get several, the same files (as per the filepath) but different IDs. You won't see this now as you've deleted the index.

One more "sanity check" would be sensible for your PostmarketOS / BTRFS system; you would need the
    .local/share/baloo
folder that holds the index *not* to be copy on write.
Comment 12 tagwerk19 2024-09-10 05:22:24 UTC
(In reply to tagwerk19 from comment #11)
> ... you would need the
>     .local/share/baloo
> folder that holds the index *not* to be copy on write ...
Check by:

    $ lsattr .local/share/baloo/
    ---------------C------ .local/share/baloo/index
    ---------------C------ .local/share/baloo/index-lock

It would be nice to know it the reindexing works...
Comment 13 Bug Janitor Service 2024-09-25 03:47:10 UTC
🐛🧹 ⚠️ This bug has been in NEEDSINFO status with no change for at least 15 days. Please provide the requested information, then set the bug status to REPORTED. If there is no change for at least 30 days, it will be automatically closed as RESOLVED WORKSFORME.

For more information about our bug triaging procedures, please read https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging.

Thank you for helping us make KDE software even better for everyone!
Comment 14 Bug Janitor Service 2024-10-10 03:47:43 UTC
🐛🧹 This bug has been in NEEDSINFO status with no change for at least 30 days. Closing as RESOLVED WORKSFORME.