Bug 373021

Summary: Baloo DB rewritten for every write, introducing delays and SSD lifespan issues when DB size grows large
Product: [Frameworks and Libraries] frameworks-baloo Reporter: marvin24
Component: generalAssignee: Pinak Ahuja <pinak.ahuja>
Status: RESOLVED WORKSFORME    
Severity: major CC: jimtahu, karl, nate, tagwerk19, vini.ipsmaker
Priority: VHI    
Version: 5.28.0   
Target Milestone: ---   
Platform: OpenMandriva   
OS: Linux   
URL: url|location bar|field -focus
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description marvin24 2016-11-28 11:48:23 UTC
I have a rather large data partition to be indexed by baloo (~100k pdf files). I noticed, that the index speed goes down the larger the db becomes. At the same time, IO goes up.

I recreated the db in a single partition and started indexing. About half of these files are index now (db size is 3.5 GB). The disk stats says 195916328 sectors written, which is about 1 TB (and yes, this is an old 128 MB ssd!).

It looks like, everytime baloo updates the db, the whole thing is written again - making it very slow... and dangerous regarding the wear leveling of flash disks.

I'm about to stop indexing because it behaves like a ssd killer. Even for old style hard disks, the IO is so heavy after some point, that you have to stop it.
Comment 1 marvin24 2016-11-28 12:07:45 UTC
#  cat /proc/diskstat | grep sdb5
   8      21 sdb5 17858242 8555586 366804000 348924376 418649 162409 198120256 53710360 0 4603528 402638072

sorry, that was 100 GB (not 1 TB). Still too much for my taste.
Comment 2 Karmaqtrp 2021-05-08 16:51:31 UTC
Discpart | format
Comment 3 Karmaqtrp 2021-05-08 16:54:05 UTC
|Format|
Comment 4 Nate Graham 2021-05-08 17:53:30 UTC
Please stop spamming bugzilla tickets with this stuff.
Comment 5 tagwerk19 2021-08-02 13:12:00 UTC
(In reply to marvin24 from comment #0)

... Some time has passed

> I have a rather large data partition to be indexed by baloo (~100k pdf
> files). I noticed, that the index speed goes down the larger the db becomes.
> At the same time, IO goes up.
To me, that doesn't seem so surprising...

I also notice that if "reindexing" files, if files have been edited or just "touched", the indexing speed drops. 

> I recreated the db in a single partition and started indexing. About half of
> these files are index now (db size is 3.5 GB). The disk stats says 195916328
> sectors written, which is about 1 TB (and yes, this is an old 128 MB ssd!).
> ...
> sorry, that was 100 GB (not 1 TB). Still too much for my taste.
If you are indexing content then there'll be an update for each word and a small change to a "page" means that the whole page is written back to the database. That's going to add up...

> It looks like, everytime baloo updates the db, the whole thing is written
> again - making it very slow... and dangerous regarding the wear leveling of
> flash disks.
There was a change (2019/09) to avoid syncing every write to the database

    https://bugs.kde.org/show_bug.cgi?id=404057#c12

So maybe there has been an improvement.

I think there is also awareness of the problem. Baloo batches up its content indexing; it reads and indexes 40 files in one go. This is done as one transaction, so the changes for the 40 files are sorted in memory and then committed.

Pages for "common terms" will be repeatedly updated/rewritten and I think you might easily expect more to be written to the disc than the static size of the database.

However, watching the writes with iotop (which can show accumulated writes for a process), there can be a frightening amount written. I'm guessing increasing the batch size would help; using more RAM and reducing the number of commits. It seems to be something of a balance.
Comment 6 tagwerk19 2024-07-04 07:40:38 UTC
Is this still an issue for you?

We can keep the call open if you feel that "batching" writes so that they are committed every 40 files is not sufficient. SSDs are more robust now and also larger, which helps spread the wear.

I'll change to "Waiting for Info" now, feel free to put it back to "Confirmed"
Comment 7 Bug Janitor Service 2024-07-19 03:46:33 UTC
Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!
Comment 8 Bug Janitor Service 2024-08-03 03:46:27 UTC
๐Ÿ›๐Ÿงน This bug has been in NEEDSINFO status with no change for at least 30 days. Closing as RESOLVED WORKSFORME.