Bug 373021 - Baloo DB rewritten for every write, introducing delays and SSD lifespan issues when DB size grows large
Summary: Baloo DB rewritten for every write, introducing delays and SSD lifespan issue...
Status: CONFIRMED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.28.0
Platform: OpenMandriva Linux
: VHI major
Target Milestone: ---
Assignee: Pinak Ahuja
URL: url|location bar|field -focus
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-28 11:48 UTC by marvin24
Modified: 2022-11-30 17:15 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description marvin24 2016-11-28 11:48:23 UTC
I have a rather large data partition to be indexed by baloo (~100k pdf files). I noticed, that the index speed goes down the larger the db becomes. At the same time, IO goes up.

I recreated the db in a single partition and started indexing. About half of these files are index now (db size is 3.5 GB). The disk stats says 195916328 sectors written, which is about 1 TB (and yes, this is an old 128 MB ssd!).

It looks like, everytime baloo updates the db, the whole thing is written again - making it very slow... and dangerous regarding the wear leveling of flash disks.

I'm about to stop indexing because it behaves like a ssd killer. Even for old style hard disks, the IO is so heavy after some point, that you have to stop it.
Comment 1 marvin24 2016-11-28 12:07:45 UTC
#  cat /proc/diskstat | grep sdb5
   8      21 sdb5 17858242 8555586 366804000 348924376 418649 162409 198120256 53710360 0 4603528 402638072

sorry, that was 100 GB (not 1 TB). Still too much for my taste.
Comment 2 Karmaqtrp 2021-05-08 16:51:31 UTC
Discpart | format
Comment 3 Karmaqtrp 2021-05-08 16:54:05 UTC
|Format|
Comment 4 Nate Graham 2021-05-08 17:53:30 UTC
Please stop spamming bugzilla tickets with this stuff.
Comment 5 tagwerk19 2021-08-02 13:12:00 UTC
(In reply to marvin24 from comment #0)

... Some time has passed

> I have a rather large data partition to be indexed by baloo (~100k pdf
> files). I noticed, that the index speed goes down the larger the db becomes.
> At the same time, IO goes up.
To me, that doesn't seem so surprising...

I also notice that if "reindexing" files, if files have been edited or just "touched", the indexing speed drops. 

> I recreated the db in a single partition and started indexing. About half of
> these files are index now (db size is 3.5 GB). The disk stats says 195916328
> sectors written, which is about 1 TB (and yes, this is an old 128 MB ssd!).
> ...
> sorry, that was 100 GB (not 1 TB). Still too much for my taste.
If you are indexing content then there'll be an update for each word and a small change to a "page" means that the whole page is written back to the database. That's going to add up...

> It looks like, everytime baloo updates the db, the whole thing is written
> again - making it very slow... and dangerous regarding the wear leveling of
> flash disks.
There was a change (2019/09) to avoid syncing every write to the database

    https://bugs.kde.org/show_bug.cgi?id=404057#c12

So maybe there has been an improvement.

I think there is also awareness of the problem. Baloo batches up its content indexing; it reads and indexes 40 files in one go. This is done as one transaction, so the changes for the 40 files are sorted in memory and then committed.

Pages for "common terms" will be repeatedly updated/rewritten and I think you might easily expect more to be written to the disc than the static size of the database.

However, watching the writes with iotop (which can show accumulated writes for a process), there can be a frightening amount written. I'm guessing increasing the batch size would help; using more RAM and reducing the number of commits. It seems to be something of a balance.