I have a rather large data partition to be indexed by baloo (~100k pdf files). I noticed, that the index speed goes down the larger the db becomes. At the same time, IO goes up. I recreated the db in a single partition and started indexing. About half of these files are index now (db size is 3.5 GB). The disk stats says 195916328 sectors written, which is about 1 TB (and yes, this is an old 128 MB ssd!). It looks like, everytime baloo updates the db, the whole thing is written again - making it very slow... and dangerous regarding the wear leveling of flash disks. I'm about to stop indexing because it behaves like a ssd killer. Even for old style hard disks, the IO is so heavy after some point, that you have to stop it.
# cat /proc/diskstat | grep sdb5 8 21 sdb5 17858242 8555586 366804000 348924376 418649 162409 198120256 53710360 0 4603528 402638072 sorry, that was 100 GB (not 1 TB). Still too much for my taste.
Discpart | format
|Format|
Please stop spamming bugzilla tickets with this stuff.
(In reply to marvin24 from comment #0) ... Some time has passed > I have a rather large data partition to be indexed by baloo (~100k pdf > files). I noticed, that the index speed goes down the larger the db becomes. > At the same time, IO goes up. To me, that doesn't seem so surprising... I also notice that if "reindexing" files, if files have been edited or just "touched", the indexing speed drops. > I recreated the db in a single partition and started indexing. About half of > these files are index now (db size is 3.5 GB). The disk stats says 195916328 > sectors written, which is about 1 TB (and yes, this is an old 128 MB ssd!). > ... > sorry, that was 100 GB (not 1 TB). Still too much for my taste. If you are indexing content then there'll be an update for each word and a small change to a "page" means that the whole page is written back to the database. That's going to add up... > It looks like, everytime baloo updates the db, the whole thing is written > again - making it very slow... and dangerous regarding the wear leveling of > flash disks. There was a change (2019/09) to avoid syncing every write to the database https://bugs.kde.org/show_bug.cgi?id=404057#c12 So maybe there has been an improvement. I think there is also awareness of the problem. Baloo batches up its content indexing; it reads and indexes 40 files in one go. This is done as one transaction, so the changes for the 40 files are sorted in memory and then committed. Pages for "common terms" will be repeatedly updated/rewritten and I think you might easily expect more to be written to the disc than the static size of the database. However, watching the writes with iotop (which can show accumulated writes for a process), there can be a frightening amount written. I'm guessing increasing the batch size would help; using more RAM and reducing the number of commits. It seems to be something of a balance.