I have a rather large data partition to be indexed by baloo (~100k pdf files). I noticed, that the index speed goes down the larger the db becomes. At the same time, IO goes up. I recreated the db in a single partition and started indexing. About half of these files are index now (db size is 3.5 GB). The disk stats says 195916328 sectors written, which is about 1 TB (and yes, this is an old 128 MB ssd!). It looks like, everytime baloo updates the db, the whole thing is written again - making it very slow... and dangerous regarding the wear leveling of flash disks. I'm about to stop indexing because it behaves like a ssd killer. Even for old style hard disks, the IO is so heavy after some point, that you have to stop it.
# cat /proc/diskstat | grep sdb5 8 21 sdb5 17858242 8555586 366804000 348924376 418649 162409 198120256 53710360 0 4603528 402638072 sorry, that was 100 GB (not 1 TB). Still too much for my taste.
Discpart | format
|Format|
Please stop spamming bugzilla tickets with this stuff.
(In reply to marvin24 from comment #0) ... Some time has passed > I have a rather large data partition to be indexed by baloo (~100k pdf > files). I noticed, that the index speed goes down the larger the db becomes. > At the same time, IO goes up. To me, that doesn't seem so surprising... I also notice that if "reindexing" files, if files have been edited or just "touched", the indexing speed drops. > I recreated the db in a single partition and started indexing. About half of > these files are index now (db size is 3.5 GB). The disk stats says 195916328 > sectors written, which is about 1 TB (and yes, this is an old 128 MB ssd!). > ... > sorry, that was 100 GB (not 1 TB). Still too much for my taste. If you are indexing content then there'll be an update for each word and a small change to a "page" means that the whole page is written back to the database. That's going to add up... > It looks like, everytime baloo updates the db, the whole thing is written > again - making it very slow... and dangerous regarding the wear leveling of > flash disks. There was a change (2019/09) to avoid syncing every write to the database https://bugs.kde.org/show_bug.cgi?id=404057#c12 So maybe there has been an improvement. I think there is also awareness of the problem. Baloo batches up its content indexing; it reads and indexes 40 files in one go. This is done as one transaction, so the changes for the 40 files are sorted in memory and then committed. Pages for "common terms" will be repeatedly updated/rewritten and I think you might easily expect more to be written to the disc than the static size of the database. However, watching the writes with iotop (which can show accumulated writes for a process), there can be a frightening amount written. I'm guessing increasing the batch size would help; using more RAM and reducing the number of commits. It seems to be something of a balance.
Is this still an issue for you? We can keep the call open if you feel that "batching" writes so that they are committed every 40 files is not sufficient. SSDs are more robust now and also larger, which helps spread the wear. I'll change to "Waiting for Info" now, feel free to put it back to "Confirmed"
Dear Bug Submitter, This bug has been in NEEDSINFO status with no change for at least 15 days. Please provide the requested information as soon as possible and set the bug status as REPORTED. Due to regular bug tracker maintenance, if the bug is still in NEEDSINFO status with no change in 30 days the bug will be closed as RESOLVED > WORKSFORME due to lack of needed information. For more information about our bug triaging procedures please read the wiki located here: https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging If you have already provided the requested information, please mark the bug as REPORTED so that the KDE team knows that the bug is ready to be confirmed. Thank you for helping us make KDE software even better for everyone!
๐๐งน This bug has been in NEEDSINFO status with no change for at least 30 days. Closing as RESOLVED WORKSFORME.