As suggested in https://bugs.kde.org/show_bug.cgi?id=333655#c73 , lets open a new bug for baloo 5: I'm running baloo 5.45.0 on openSUSE Leap 15, and notice that my complete desktop freezes regularly for 1-2 minutes(!). CPU monitor reports during that time 100% Load on both cores, but top does not show any process of a considerable CU load. The problem is more the couple of baloo and akonadi, as iotop shows: Total DISK READ : 10.37 M/s | Total DISK WRITE : 1060.53 K/s Actual DISK READ: 10.37 M/s | Actual DISK WRITE: 197.36 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 4497 idle axel 9.73 M/s 0.00 B/s 0.00 % 99.52 % baloo_file_extractor 2847 idle axel 651.54 K/s 1058.15 K/s 0.00 % 97.97 % akonadi_indexing_agent --identifier akonadi_indexing_agent 23 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.10 % [kworker/1:1] 2479 be/4 axel 0.00 B/s 0.00 B/s 0.00 % 0.08 % plasmashell 849 be/4 root 4.76 K/s 0.00 B/s 0.00 % 0.00 % [xfsaild/sda2] (interesting percentage calculation of iotop by the way) System disk is a SSD, data disk is a hybrid 1TB disk with 8G cache. I have configured the search to not index the file content. Thats why the heavy IO surprises me even more
Unfortunately even the two most fundamental databases in baloo, the Terms and the FileNameTerms DBs, show O(M^2) behaviour on updates. Everytime e.g. a "pdf" is changed, the associated value (i.e. the IDs of all matching documents) for the "pdf" term is updated. An update may happen in two cases: 1. an existing file is appended, tagged, renamed ... 2. an existing file is replaced by an updated one (i.e. application creates a temporary file on saving and atomically replaces the old one). For (1.), the update can be minimized, i.e. only updating the terms which have actually changed. I have some experimental patches for this. For (2.), the database scheme has to be changed significantly.
Thanks for your explanation, Stefan. Although I dont know how I can influence the behaviour. If I start the computer the next day I would not expect heavy re-indexing. Are - by default - the database stores for akonadi (~/.local/share/akonadi )excluded from baloo indexing?
I came in to report the same problem. The system frequently freezes, with the mouse not moving for a couple seconds, or the screen not being refreshed. Regardless of what is causing high IO usage within baloo and akonadi, I consider them background tasks (most of the time), and I would like to see them prioritized as such. Could baloorunner be ran with the equivalent of ionice -c 3 by default? (and maybe nice as well). My CPU is quite beefy, but I suffer of I/O contention: Arch Linux Ryzen 7 2700X 8 GiB DDR4 2666 4TiB HDD system drive (WDC WD40EZRZ) I will probably upgrade to a SSD at some point, but this is no excuse for a background task to consume all of the available disk IO bandwidth ;)
(In reply to Mayeul Cantan from comment #3) > Could baloorunner be ran with the equivalent of ionice -c 3 by default? (and > maybe nice as well). My CPU is quite beefy, but I suffer of I/O contention: baloo_file/baloo_file_extractor, which are the indexing task (i.e. the one causing write accesses) are already running with lowest priority. baloorunner is not relevant here. Even with low priority, the kernel eventually has to flush the write buffers, causing the high I/O latency for other tasks.
(In reply to Stefan Brüns from comment #4) > Even with low priority, the kernel eventually has to flush the write > buffers, causing the high I/O latency for other tasks. Should the I/O traffic from higher prioritized tasks not processed before as well? I mean, if baloo does not get any CPU time, how can it create such a high traffic? Looking at iotop, it is mostly a factor 100 to 1000 higher than other tasks....
(In reply to Axel Braun from comment #5) > (In reply to Stefan Brüns from comment #4) > > > Even with low priority, the kernel eventually has to flush the write > > buffers, causing the high I/O latency for other tasks. > > Should the I/O traffic from higher prioritized tasks not processed before as > well? I mean, if baloo does not get any CPU time, how can it create such a > high traffic? Looking at iotop, it is mostly a factor 100 to 1000 higher > than other tasks.... From this link, it seems to be the case (though a link to the kernel source would have been nicer) https://unix.stackexchange.com/questions/153505/how-disk-io-priority-is-related-with-process-priority > io_priority = (cpu_nice + 20) / 5 In my case, though, it was always baloorunner showing at 99.99 % I/O in iotop. baloo_file_extractor would also run sometimes, but with a lesser subjective impact on performance. Setting baloorunner to a lower priority using ionice seemed to improve things quite a bit, although I would have to confirm it. I get the point about needing to flush the cache at some point. Unfortunately, I am at a loss as to why my mouse freezes because of it. I am on a 8 (16 SMT)-core CPU, and only a couple are used by the kernel. CPU <-> RAM bandwidth should not be the limiting factor, and other threads should be able to go trough when CPU <-> Sata Controller is being waited on. Maybe it has to do with interrupts comming in from the SATA controller?
Same problem with baloo 5.52.0 (on Artix Linux). GUI is almost completely unresponsive. Switching to text console and back updates the screen, but it mostly stays frozen. Sometimes clicking to switch between applications updates things when I click, but otherwise frozen. iotop shows baloo_file_extractor and one [kworker...] job at 99.99% (sometimes alternating with a lower value still above 50%.) Systemsettings/search does not have any setting to turn indexing off, although no plugin is checked. balooctl does seem to show everything disabled and stopped, so I have no idea why . For me, this seems to have started relatively recently, but it's on a laptop I don't use constantly, so I'm really not sure what updated triggered it. Is there anything else I can check, or any other data I can provide. It makes the laptop essentially unusable. (I'm posting this from a different PC (Gentoo) although baloo here is 5.50.0 - I'll try updating.)
After several reboots, I finally had systemsettings5 show me file search, and turning that off, and another reboot, seems to have stopped the indexer from running. The odd thing was that despite earlier doing balooctl suspend, balooctl stop, and balooctl disable, and balooctl showing disabled, it was still running. Not really sure what finally stopped it. Hopefully it wont just start up again by itself.
*** Bug 400932 has been marked as a duplicate of this bug. ***
*** Bug 401279 has been marked as a duplicate of this bug. ***
*** Bug 384234 has been marked as a duplicate of this bug. ***
*** Bug 379011 has been marked as a duplicate of this bug. ***
*** Bug 376446 has been marked as a duplicate of this bug. ***
There's a proposed patch in Bug 356357 that sparked a serious discussion about the frequency with which the DB should be written to, but unfortunately it went nowhere.
*** Bug 359119 has been marked as a duplicate of this bug. ***
*** Bug 393465 has been marked as a duplicate of this bug. ***
Since I'm not using Plasma right now I'm unsubscribing from this bug, but feel free to re-subscribe me if you needed any help from me.
(In reply to Nate Graham from comment #14) > There's a proposed patch in Bug 356357 that sparked a serious discussion > about the frequency with which the DB should be written to, but > unfortunately it went nowhere. I am still suffering this problem. Yesterday nextcloud decided to refresh my files and downloaded about 10G of files. Baloo started indexing and my desktop stalls. Chrome can't start and and can do no work!!!! I do hope we can get a solution soon - this is a long standing problem. Finding things with an baloo saves me time... but not as much as I am loosing whilst waiting for the indexer!!!!! Please can we have a solution - I like the idea of throttling database updates - perhaps some sort of exponential stand-off approach but inverted so high number of files index per minute changes updates to 80, 160, 320 ... limit ?
An exponential backoff would only help if baloo would index the same files recurrently. If you add new documents to your indexed folders, baloo will process these. It will not get better when you commit changesets double the size, the stalls will be even longer. This is *not* a trivial problem which can be solved by adjusting a single knob. Baloos datastructures currently impose a changeset size which is approximately proportional to the size of the database. Adding/changing a single small document can cause a DB update of several 100 MBytes.
(In reply to Stefan Brüns from comment #19) > An exponential backoff would only help if baloo would index the same files > recurrently. > > If you add new documents to your indexed folders, baloo will process these. > It will not get better when you commit changesets double the size, the > stalls will be even longer. > > This is *not* a trivial problem which can be solved by adjusting a single > knob. > > Baloos datastructures currently impose a changeset size which is > approximately proportional to the size of the database. Adding/changing a > single small document can cause a DB update of several 100 MBytes. Thanks for the prompt feedback. Currently I have to do a manual exponential backoff of switching off baloo and turning it on overnight to do it's indexing!!! Given that a "single small document can cause a DB update of several 100 MBytes." might there need to a fresh look given to the underlying data structure? That seems sub-optimal to me as a user who is struggling with the indexing processes unintended side-effects.
It would save a lot of developer time if not everyone would add their "me too" comments. Changes to the database are planned, but this is not trivial. One structure may work well for a number of cases and cause huge problems for others. These changes have to be evaluated, for performance and for correctness. The baloo codebase has been enhanced with additional unit tests recently, increasing code coverage and reducing the chance for regressions. This is an ongoing effort likely taking several more months until completeted. Baloo is currently developed mostly by volunteers doing it in their spare time. Development will not go faster by adding some more exclamations marks ...
(In reply to Stefan Brüns from comment #21) > It would save a lot of developer time if not everyone would add their "me > too" comments. > > Changes to the database are planned, but this is not trivial. One structure > may work well for a number of cases and cause huge problems for others. > These changes have to be evaluated, for performance and for correctness. > > The baloo codebase has been enhanced with additional unit tests recently, > increasing code coverage and reducing the chance for regressions. This is an > ongoing effort likely taking several more months until completeted. > > Baloo is currently developed mostly by volunteers doing it in their spare > time. Development will not go faster by adding some more exclamations marks > ... Dear Developers, I am supremely grateful for all the work and efforts that have gone into the indexing services for KDE. If I had the skills I would join you. I just glanced at the Git repo and realised how unskilled I am to contribute; I couldn't even find the schema. Baloo has improved greatly. However, I do wish to say please don't discourage well intentioned feedback. Without feedback from users about their actual problems encountered future priorities may not be as readily identified. As a long term KDE user, enthusiast and advocate feedback is one of my most important contributions. This thread follows from https://bugs.kde.org/show_bug.cgi?id=333655#c73 which was started in 2014. I am only making my first comment now. The performance issues have been a problem to me for all this time and I went for a long season with baloo permanently off! Do let me know if there is anything concrete I can contribute more than what I offer in these comments.
currently after system start and sometimes during work baloo grabs one cpu for 100% for quite a while, eats up to 13GB ram and makes the system quite unresponsive (i7 w 4c+ht, 20gb ram, only ssd) when running. this looks more like a complete reindexing of everything on the system and not related to the amount of changed files. and i don't see how I could find what exactly baloo is working on through means of balooctl. i use fedora 29 with standard schedulers. baloo taking 100% of cpu seems not reasonable to me. also taking up to 13GB ram seems not reasonable to me. during the last long run baloo status stated 130470/133866 files index and current index size 15,61 GiB. there was no change of more than 3000 large files since the last baloo 100% cpu run. indexing of a new 185m git clone should have be done in very few minutes max. Further the numbers in indexSize look strange to me Actual Size: 15,61 GiB Expected Size: 9,16 GiB PostingDB: 1,40 GiB 120.956 % PositionDB: 133,54 MiB 11.266 % DocTerms: 877,32 MiB 74.014 % DocFilenameTerms: 13,61 MiB 1.148 % DocXattrTerms: 0 B 0.000 % IdTree: 2,52 MiB 0.213 % IdFileName: 10,07 MiB 0.850 % DocTime: 5,64 MiB 0.476 % DocData: 6,92 MiB 0.584 % ContentIndexingDB: 0 B 0.000 % FailedIdsDB: 0 B 0.000 % MTimeDB: 2,18 MiB 0.184 % Why is the expected size only 2/3 of actual? And why don't the DB sizes sum up to the actual size? And what does 120% really mean?
(In reply to richard from comment #23) > > Why is the expected size only 2/3 of actual? > And why don't the DB sizes sum up to the actual size? > And what does 120% really mean? Re actual size: https://cgit.kde.org/baloo.git/commit/?id=f8c51b23796523f9b2d9d1582c7fb874181fbf2f Re 120%: https://cgit.kde.org/baloo.git/commit/?id=7be886c93d13191c6ebdf72669f657cbbf45c2c7
============================== Dear Users, the issue described in this bug report is well understood. Solving the problem requires significant changes to the database scheme. Before doing this changes we have to be sure not to regress other use cases. Screening bugs takes time, time better spent working on this problem and solving other issues at hand. Please refrain from adding additional comments here! Kind regards, Stefan ==============================
Hi the comming change from actual to file size is good, also the change from expected to used. the link concerning 120% is not really clear to me. some sum was changed. but the ouput still remains unexplained, 120% of what. i understand that you identified the database layout as *the* problem. from my point of view i'd see the cpu (-> io) greed as an problem. in another thread there was the statment that it works better/less blocking the system with other schedulers. it really would be ok, at least for me, if the indexing is done silently in background and not as fast as posslble, blocking the system. looks independent from db schema changes to me. at the moment I work with ml datasets -> archive files, but below GB size. it looks as if the baloo_file_extr ist the process to be blamed. currently: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND **** user 39 19 259,7g 13,6g 10,3g R 97,0 70,5 355:49.30 baloo_file_extr I let it run the whole night to get through his work - not ready yet. still freezing the system (even mouse) quite often. as written i'd be ok to not index these files as fast as possible. and i can't really understand how it can take 7 hours of fast cpu time to index less than sub gb archives for the baloo_file_extr. looks independent from db schema changes to me. you didn't answer the point that one can see what baloo is currently doing. that could help a) for debugging and b) adjusting/excluding directories/files from indexing. with options to tune indexing like - don't index when (allow combinations) -- filetype -- if size is smaller/bigger -- is is more/less than timespan at specific disk location -- is is more/less than timespan created / modified and a monitor command that allows to see the freeze causes in realtime and a log that allows to see the freeze causes (files) later (log start indexing / stop index of file) (when indexing it takes more than ___ minutes between start and stop, when indexing took more than ___ minutes since start it, .... . this looks independent from db schema changes to me and being able to tune baloo just to not do some things it has problems with would help to optimize the usability until the big rewrite is done. it's a question how priorities are set.
Confirmed on openSUSE Tumbleweed with Frameworks 5.64. Everything freezes for about 30 seconds, works for 30 seconds, then freezes again. The only solution is to turn Baloo completely off. This is also a considerable problem when only files, not their contents, are being indexed. Operating System: openSUSE Tumbleweed 20191124 KDE Plasma Version: 5.17.3 KDE Frameworks Version: 5.64.0 Qt Version: 5.13.1 Kernel Version: 5.3.12-1-default OS Type: 64-bit Processors: 4 × Intel® Core™ i5-4210U CPU @ 1.70GHz Memory: 11,6 GiB
This happens both if Baloo is index file *contents*, and when it's just indexing info on files. I have to turn indexing completely off, if not my computer becomes practically unusable upon every power on. This is new and I've never had trouble with Baloo on this laptop. It's admittedly a 4 year old computer, but the SSD is fast and the laptop handles most of my workflow otherwise completely fine.
Same problem here - Baloo eats up all IO even when reporting "idle". This has become a problem only in the last year or so. I'm using a pretty beefy i7 device with an NVME and while Baloo is enabled the computer often is slow and freezes from time to time. Looking at CPU usage I see `baloo_file` takes 50%~80% CPU, and loadavg is around 3~4.5 (on 4 core system). Looking at IO: ---8<--- $ pidstat -G balo[o] -dl 5 1; balooctl status Linux 5.4.0-17-generic (vesho) 03/20/2020 _x86_64_ (8 CPU) 12:48:52 AM UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command 12:48:57 AM 1000 80482 26164.94 64858.96 0.00 170 /usr/bin/baloo_file Average: UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command Average: 1000 80482 26164.94 64858.96 0.00 170 /usr/bin/baloo_file Baloo File Indexer is running Indexer state: Idle Total files indexed: 297,288 Files waiting for content indexing: 0 Files failed to index: 0 Current size of index is 2.56 GiB ----8<---- So balooctl reports "Idle" while baloo_file pushes >60Mbit/sec to the drive and does not insignificant reading. In .xsession-errors log I can see a lot of messages like this: ----8<---- org.kde.baloo.engine: DocumentDB::get 307907124573241397 MDB_NOTFOUND: No matching key/data pair found ----8<----
A very real and annoying issue. I've kept Baloo disabled for years now, due to it putting my hard drive in "disk sleep" and causing processes on the system to freeze while waiting for drive access. Nowadays I have a different HDD setup so I managed to enable it with some directories blacklisted. Still eats more RAM than it should... if it's not drive I/O it's gonna be the memory or CPU.
It could be that there are several different issues being "bundled together". 1... There are, for example, problems with openSUSE that runs BTRFS with multiple subvols, check with finding one of the files indexed and trying the following... stat testfile balooshow -x testfile and baloosearch -i filename:testfile The "stat" would give you the device and inode number of the file. You should see the same numbers listed in the "balooshow -x" results. See: https://bugs.kde.org/show_bug.cgi?id=402154#c12 If the device/inode numbers change for a file, baloo will think it is a different file and index it again. You can see this evidenced in the "baloosearch -i" results, you could get multiple results (different ID's; same file) 2... Repeated spike loads at logon. In cases where there are *very* *many* new files, even if content indexing is disabled, the initial scan by baloo_file takes too many resources, My reading of the behaviour is that baloo_file does not "batch up" updates to the index as it discovers new/changed/deleted files. There's therefore no hint (looking at "balooctl status") that there's any progress being made, it may be that the indexing if "Idle" as just an initial scan is being done (and not content indexing) and the RAM used by baloo_file can grow steadily (potentially extending to swap space). As per Bug 394750: https://bugs.kde.org/show_bug.cgi?id=394750#c13 If the updates from an "initial scan" are done as a single transaction there are no checkpoints. Killing the process and starting again, rebooting or logging out and back in again will start "from scratch". Bug 428416 is also interesting in terms of what baloo_file is doing when it deals with a large indexing run. 3... It seems likely that with baloo reindexing files as they reappear with different ID's (as per '1' above) the index size balloons; on disc and in terms of pages pulled into memory. This will compound issue '2'. 4... On a positive note, the impact (as seen by the user) of a sync of the dirty pages to disc could be manageable if the index is on an SSD Comment 19 argues against increasing the batch size (that the data will have to be written at some time). This would hammer HDD users but maybe have has less impact on SSD users. With an SSD, there's the counter argument that you want to avoid frequent rewrites to prolong the life of the disc. Gut feeling is that with a larger batch size, the data written to disc is less in total. Wishlist/Proposals/Suggestions I think baloo needs to "batch up" its transactions in its initial scan. If I were to suggest "how often", I'd pick a time interval, maybe every 15 or 30 seconds. It would be nice to have a "balooctl" option (or a setting within baloofilerc) to tune the batch size used for baloo_file_extractor. That would make it possible to do indexing comparisons "in the real world" Consider this as a "Where are we?" summary; an attempt to collect together different threads and weave in new evidence.
The issue seems to have gotten somewhat better at this day, especially with the latest Plasma version 5.22. Though I've since moved to using an SSD / NVME drive, might be why disk sleep isn't as bad as it used to be during indexing. Another issue now seems to be the baloo processes are using more memory than I wish they did, based on the amount of files it indexed. If anyone has a large HDD but not enough RAM, they'll need to blacklist every large directory.
(In reply to tagwerk19 from comment #31) > Consider this as a "Where are we?" summary; an attempt to collect together > different threads and weave in new evidence. Weaving in a couple of extra references "for completeness": 5... Removing baloo records for deleted files seems to be slow (more I/O intensive than the original indexing). See Bug 442453 6... Running a "balooctl status" while baloo is removing records for deleted files, causes memory consumption and index size to balloon, Bug 437754
Hi, One comment I have not seen in the long list since 2014 : The slow down appears as I had just upgraded to 20.04 LTS ans I remember that I had the same problem 3 years ago after upgrading to 18.04. So I had a day or two leaving the computer on so it would get over indexing (during a weekend) Wouldn't it be nice if the database was left as it is while upgrading ?
(In reply to pierre from comment #34) > The slow down appears as I had just upgraded to 20.04 LTS ans I remember > that I had the same problem 3 years ago after upgrading to 18.04. You would get a reindexing if the device number of your discs changed. You can see if that has happened if you run $ baloosearch -i filename:"one of your files" and you get multiple results with different ID's. Check the file itself $ stat "one of your files" and compare the device details: Device: fc01h/64513d Inode: 1053347 Links: 1 Beyond that, I'm not sure. I don't remember having met the issue.
One more observation for the collection. It may be that "spike loads" in memory usage trigger OOM protection and baloo_file_extractor and baloo_file are killed. Tangentially observed in Fedora 35: https://bugs.kde.org/show_bug.cgi?id=443547#c2 but needs a closer look...
(In reply to tagwerk19 from comment #35) > $ baloosearch -i filename:"one of your files" > and you get multiple results with different ID's. Check the file itself > $ stat "one of your files" Hi, Just 1 file but chosen at random. Actually, this file might not have been "baloo-ed" before I killed baloo-file-extractor. No way to find out but through sample polling files and testing them the way you suggest, isn't it ? (way beyond my ability)
Another reference "for completeness": 8... baloo_file_extractor can get caught on files that require hours to index, the example case being a PDF containing a scientific plot. The plot itself is compressed data with little indexable content and unpacking it may require more RAM than you have available See https://bugs.kde.org/show_bug.cgi?id=380456#c21 It's possible that such indexing attempts trigger OoM protections and therefore never complete. It would make sense to have time/memory limits for such actions (and flag the file as "failed" if the extraction exceeds them).
(In reply to tagwerk19 from comment #38) +1 on that idea. Dolphin actually has a file size limit for generating thumbnails, in Manjaro you need to manually remove it or most images won't generate thumbnails at all. It would be more than logical to have something like this for Baloo, indicating a size limit past which a file's contents will not be indexed (only its name and location). Thanks for this suggestion.
Wow, kudos for the new website design for Bugzilla. It could also be a limit up to which files would be indexed. I.e. go for the first 100 KiB and ignore the rest, instead of just indexing the file name in such cases. Not sure whether it is worth to do it this way. IMHO it depends on the type of file. For a lot of file formats for larger files it would only make sense to index metadata, like for video or sound or image files. I think and hope that Baloo is already doing this. Other large files are archives like tarballs or ZIP files.
(In reply to Martin Steigerwald from comment #40) > ... go for the > first 100 KiB and ignore the rest, instead of just indexing the file name in > such cases... At the moment there's a 10MByte limit for text or html: https://bugs.kde.org/show_bug.cgi?id=410680#c7 Personal preference would be that the first 10MB is indexed and the rest ignored but it seems that if the file is more (more or less more) than 10MB it's not indexed.
(In reply to tagwerk19 from comment #41) Yeah 100MB sounds like a good default limit for all files. I'd make it an option in the search settings of course, users should be able to customize this based on the amount of files they have and the power of their computer.
(In reply to tagwerk19 from comment #38) > It would make sense to have time/memory limits for such actions (and > flag the file as > "failed" if the extraction exceeds them). Was thinking about this and similar IO problems, and decided to have a look at how Gnome's "tracker" is handling things these days. Going to document my findings here in the hope it's useful as inspiration for how we might handle similar problems. I think it's an important point of comparison for Baloo. I have mostly positive things to say, although Tracker also has some flaws (it didn't pick up my XDG Documents folder by default, it didn't index the contents of files with text/plain mimetimes that don't have file extensions, and it uses a large amount of CPU while searching in Nautilus). * I enabled Tracker to index my home folder (with content indexing) and it uses 474 MB on my $HOME. I've completely disabled content indexing for Baloo, but it's somehow using 1.4 GB. Suffice it to say that Baloo is weirdly inefficient. (ContentIndexingDB is empty, so it's not old content indexes.) More research needed here, any suggestions appreciated. * Unlike Baloo, Tracker does not hang when given pathological files. (See the link in tagwerk19's comment for an example.) I get a very sensible "Crash/hang handling file" message in the log for this file and it's otherwise ignored. Among other checks, they appear to kill the process if the content indexer takes more than 30 seconds on a file, which seems quite reasonable: https://gitlab.gnome.org/GNOME/tracker-miners/-/blob/master/src/tracker-extract/tracker-extract.c * They have some cool features around full text search including unaccenting and case folding, and use SPARQL for queries: https://wiki.gnome.org/Projects/Tracker/Features I haven't seen enough documentation from Baloo to know how we stack up there. * Tracker and Baloo both blacklist source code files by default, among several other types. Baloo doesn't expose this to the user in the UI, which I think might surprise some users who expect more configurability from KDE. * Tracker seems not to be very configurable. There's a bit of under the hood adjustment possible, but mostly the focus seems to be on having good heuristics out of the box. I don't think we could trivially swap Tracker for Baloo and having everything we need work. We'll need to keep improving Baloo. :-) This comment might be better off on the Wiki somewhere, but it seems pretty underutilized and I'm not sure where I'd put it or if anyone would even read it there.
As per: https://bugs.kde.org/show_bug.cgi?id=404057#c43 I think the dust has probably settled after: https://invent.kde.org/frameworks/baloo/-/merge_requests/131 and cherrypicked for KF5 https://invent.kde.org/frameworks/baloo/-/merge_requests/169 There's also been https://invent.kde.org/frameworks/baloo/-/merge_requests/121 and https://invent.kde.org/frameworks/baloo/-/merge_requests/148 I reran the "torture test" suggested in Bug 404057 and Baloo indexed the data without issues, I think we are in far better shape (and have SSDs rather the HDDs) Do we need to keep this issue open or is it possible to close?
*** Bug 446071 has been marked as a duplicate of this bug. ***
*** Bug 461256 has been marked as a duplicate of this bug. ***
*** Bug 492583 has been marked as a duplicate of this bug. ***
Created attachment 178734 [details] Flamegraph for baloo_file_extractor I've recorded a flamegraph with hotstop on openSUSE TW with Plasma 6.3.0, Frameworks 6.11.0, Qt 6.8.2, when it was using constantly ~20-40% CPU and a few GB of RAM and made my system with an 12-core 5900X and an NVMe drive significantly lag. You can see that Baloo stucks most of the time in `Baloo::PositionDB::get(QByteArray const&)` and `Baloo::PostingDB::get(QByteArray const&)` called from `Baloo::WriteTransaction::commit()`
(In reply to postix from comment #48) > Created attachment 178734 [details] > Flamegraph for baloo_file_extractor One thing flamegraphs confuse me - is mdb_get really slow, or is it just called way too many times? On my system baloo_file is consuming ~30% of *a* CPU, almost constrantly, but it has a RES of 16MB and running with NICE 19, so I don't care that much. Its virtual size is 2.7GB which is mostly mapped file (i.e. don't care) and about 500MB of heap that is apparently swapped out - and I'm not running out of swap anytime soon, so I don't care that much either. When you look at RAM usage, try not to consider VIRT as a meaningful metric - mapped files don't actually use a significant amount of system resources, and unless you are running out of swap space or working RAM (i.e. not buffers/cache), even swapped out heap isn't much of a bother. 500MB of heap is significant and worrying, but shouldn't an immediately problem for day to day operations on a modern PC.
(In reply to postix from comment #48) > You can see that Baloo stucks most of the time in > `Baloo::PositionDB::get(QByteArray const&)` and > `Baloo::PostingDB::get(QByteArray const&)` called from > `Baloo::WriteTransaction::commit()` A lot has changed since this bug was originally reported: an understanding of the effects of changing device numbers, as occurs with BTRFS, and the systemd contraints on RAM usage. I would say it's worth opening a new bug, including details of the system, and we can try to pin down what's happening. The first think I'd want to check is whether the index holds multiple records for a file, you can check with "baloosearch -i ...." (or baloosearch6). If you get multiple hits for one file than the index is holding the unnecessary info and having to read and rewrite the records, which are far, far larger records, whenever anything changes. I suspect you would quite easily see this as CPU load - as far as I remember Baloo sorts the entries within the record. Could be loads of work.
Created attachment 178802 [details] Screenshot of htop Answer to https://bugs.kde.org/show_bug.cgi?id=400704#c49 > but it has a RES of 16MB mine has a RES of ~ 3 GB and I also see that 3.8 GB got swapped after ~ 10 minutes. Please see the screenshot of htop.
(In reply to postix from comment #48) > ... made my system with an 12-core 5900X and an NVMe drive significantly lag ... An additional thought here; I've noticed the impact of indexing in a couple of situations: One is when Baloo has hit the limit of memory it is allowed to use. Normally it runs within systemd, with a unit file that sets a 500MB limit on RAM. If Baloo is indexing *a lot*, it may hit that limit and discard (and repeatedly reread) clean pages or start swapping dirty pages. These mean that there is far more IO to the index file and when you start swapping, you are in trouble. You can see what memory is used with a "systemctl status --user kde-baloo" and look at the "Memory" line... Second is if you have deleted a large collection of files - meaning that Baloo has to catch up with removing entries from the index, something that is equivalent to indexing the files in the first place. Both of these come into play if you are content indexing, not really when just indexing filenames/ > > You can see that Baloo stucks most of the time in > `Baloo::PositionDB::get(QByteArray const&)` and > `Baloo::PostingDB::get(QByteArray const&)` called from > `Baloo::WriteTransaction::commit()`
> I would say it's worth opening a new bug, including details of the system, and we can try to pin down what's happening. Done, please see bug #500665
(In reply to postix from comment #53) > Done, please see bug #500665 Referencing a discovery in that Bug https://bugs.kde.org/show_bug.cgi?id=500665#c51 The issue was on a Tumbleweed system and this had "DefaultMemoryAllocation=no" which prevented the "MemoryMax=512MB" limit in the unit file for Baloo working. If Baloo, with a large index, is affecting performance on Tumbleweed (maybe other SUSE systems?). Check that it is running under systemd with a "systemctl status --user kde-baloo" and that the status includes a line reporting memory usage.