Bug 446071 - Baloo is currently not usable (performance problems)
Summary: Baloo is currently not usable (performance problems)
Status: REPORTED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Baloo File Daemon (show other bugs)
Version: 5.88.0
Platform: Arch Linux Linux
: NOR major
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
Depends on: 400704
Blocks:
  Show dependency treegraph
 
Reported: 2021-11-25 11:48 UTC by sourcemaker
Modified: 2023-12-21 11:30 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description sourcemaker 2021-11-25 11:48:26 UTC
Baloo has been running on my desktop system for 5 days and I still see no significant progress.

Once the file indexer is running, the desktop can no longer be used. 
No response time. No mouse pointer. Dead.
So currently, I have to run the desktop search every night.

Performance comparison 
====================
Yesterday: 
122.834 

Today: 
120.836 

Difference: 
1998

1998 files in 8 hours? That's very slow!

balooctl status
====================
Baloo File Indexer is running
Indexer state: Idle
Total files indexed: 368.400
Files waiting for content indexing: 120.836
Files failed to index: 0
Current size of index is 44,92 GiB
Comment 1 tagwerk19 2021-11-25 20:32:08 UTC
(In reply to sourcemaker from comment #0)
> balooctl status
> ...
> Current size of index is 44,92 GiB
Which I suspect is more than your RAM...

See what "balooctl indexSize" says, particularly if there's a big difference between "File Size" and "Used". I think you are not using BTRFS? I seem to remember you were using arch... 

I know there are times (when one process is reading the index while the indexer is writing), that the index size can explode.

There is a LMDB utility for copying/compressing the index, see:
    http://www.lmdb.tech/doc/man1/mdb_copy_1.html
Anecdotal experience is that "it worked a few times for me", your mileage may of course vary...
Comment 2 sourcemaker 2021-11-25 20:58:38 UTC
balooctl indexSize
==============
File Size: 47,23 GiB
Used:      2,50 GiB

           PostingDB:       2,03 GiB    81.320 %
          PositionDB:       3,13 GiB   125.602 %
            DocTerms:       1,25 GiB    49.970 %
    DocFilenameTerms:      24,62 MiB     0.963 %
       DocXattrTerms:       4,00 KiB     0.000 %
              IdTree:       4,73 MiB     0.185 %
          IdFileName:      26,76 MiB     1.047 %
             DocTime:      13,90 MiB     0.544 %
             DocData:       7,43 MiB     0.291 %
   ContentIndexingDB:       3,25 MiB     0.127 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       6,21 MiB     0.243 %

Memory
=======
16 GB
Comment 3 tagwerk19 2021-11-25 21:43:56 UTC
(In reply to sourcemaker from comment #2)
> File Size: 47,23 GiB
> Used:      2,50 GiB
I think you've not a lot to lose by trying the mdb_copy, I'm not sure what the arch package is but "dnf install lmdb" works on Fedora, then
    pkill baloo
    cd ~/.local/share/baloo
    mdb_copy -n -c index index.new
and wait...

When finished rename the files to swap them. You should see "balooctl status" and "balooctl indexSize" taking info from the compressed index.

Good luck
Comment 4 sourcemaker 2021-11-25 23:46:52 UTC
mdb_copy -n -c index index.new
=========================
File Size: 26,50 GiB
Used:      2,50 GiB

           PostingDB:       2,03 GiB    81.320 %
          PositionDB:       3,13 GiB   125.602 %
            DocTerms:       1,25 GiB    49.970 %
    DocFilenameTerms:      24,62 MiB     0.963 %
       DocXattrTerms:       4,00 KiB     0.000 %
              IdTree:       4,73 MiB     0.185 %
          IdFileName:      26,76 MiB     1.047 %
             DocTime:      13,90 MiB     0.544 %
             DocData:       7,43 MiB     0.291 %
   ContentIndexingDB:       3,25 MiB     0.127 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       6,21 MiB     0.243 %
Comment 5 tagwerk19 2021-11-26 08:47:08 UTC
(In reply to sourcemaker from comment #4)
> mdb_copy -n -c index index.new
> =========================
> File Size: 26,50 GiB
> Used:      2,50 GiB
Hmm... Not as much space recovered as I hoped. I am guessing that won't help you much...

There's a behaviour with LMDB that if one process is reading the index when another wants to write, the data written is appended. It's there to help "crash proof" the index.

You might meet this in a baloo context if you do "balooctl status" as this counts the files "indexed" and "to be done". If you are counting a large number of files and indexing at the same time you might fall into the trap. I met it after deleting some thousands of files, have a look at Bug 437754

I wonder if your next step is to reindex from scratch, keeping an eye on progress with "balooctl monitor"; maybe restricting the directories you are interested in, at least initially. A common compromise is Documents, Music, Pictures, Videos
Comment 6 sourcemaker 2021-11-26 12:28:00 UTC
I hope there are updates soon.
In the current version it is unfortunately a waste of time.
Comment 7 tagwerk19 2021-11-26 16:00:37 UTC
(In reply to sourcemaker from comment #6)
> In the current version it is unfortunately a waste of time.
We don't know, in your case, what triggered the index file size to be so much larger than the "used", the matching up to Bug 437754 is something of a guess.

However if this is an issue (when content indexing as well as when doing bulk deletes), maybe baloo_file_extractor could "hold off" committing a transaction if there's another process reading.

No idea whether there's a practical way of doing this, it would need someone with pretty deep knowledge of the baloo code and LMDB to be able to say.
Comment 8 sourcemaker 2023-09-03 16:23:56 UTC
Are there any news about this problem?
Comment 9 tagwerk19 2023-09-03 17:02:35 UTC
(In reply to sourcemaker from comment #8)
> Are there any news about this problem?
There's been a change here:
    https://invent.kde.org/frameworks/baloo/-/merge_requests/124
that makes use of the ability to limit memory usage that systemd gives.

The change is pretty aggressive, limiting memory usage to 512M. There's a follow on change here:
    https://invent.kde.org/frameworks/baloo/-/merge_requests/148
that fixes one of the problems that constraining the memory triggers.

My guess about "not usable" is that it's memory (or swap) dependent rather than CPU or IO. I've been setting my limits to 50% RAM and zero swap. Your mileage, as they say, may vary...
Comment 10 sourcemaker 2023-09-10 02:43:02 UTC
Unfortunately Baloo still doesn't work.
16 GB Ram and Baloo doesn't finish.
Comment 11 dietervdwes 2023-09-21 18:00:56 UTC
Operating System: Debian GNU/Linux 12
KDE Plasma Version: 5.27.5
KDE Frameworks Version: 5.103.0
Qt Version: 5.15.8
Kernel Version: 6.1.0-10-amd64 (64-bit)
Graphics Platform: X11
Processors: 4 × Intel® Core™ i7-7600U CPU @ 2.80GHz
Memory: 15,5 GiB of RAM
Graphics Processor: Mesa Intel® HD Graphics 620
Manufacturer: Dell Inc.
Product Name: Latitude 7480

System continually seems to run the baloo_file_extractor, quite frustrating so I've just suspended it.
balooctl status output:

balooctl status
kf.i18n: KLocalizedString: Using an empty domain, fix the code. msgid: "Unknown" msgid_plural: "" msgctxt: ""
kf.i18n: KLocalizedString: Using an empty domain, fix the code. msgid: "Indexing file content" msgid_plural: "" msgctxt: ""
Baloo File Indexer is running
Indexer state: Indexing file content
Total files indexed: 514 133
Files waiting for content indexing: 187 525
Files failed to index: 0
Current size of index is 8,72 GiB
(base) dieter@dell7480:~$ balooctl indexSize
File Size: 8,72 GiB
Used:      1,32 GiB

           PostingDB:       2,43 GiB   183.621 %
          PositionDB:       1,69 GiB   127.993 %
            DocTerms:       1,09 GiB    82.760 %
    DocFilenameTerms:      28,98 MiB     2.139 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       7,56 MiB     0.558 %
          IdFileName:      32,74 MiB     2.417 %
             DocTime:      19,79 MiB     1.461 %
             DocData:       8,41 MiB     0.621 %
   ContentIndexingDB:       4,77 MiB     0.352 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       5,80 MiB     0.428 %
(base) dieter@dell7480:~$ balooctl suspend
Comment 12 tagwerk19 2023-09-21 20:48:18 UTC
(In reply to dietervdwes from comment #11)
> System continually seems to run the baloo_file_extractor, quite frustrating so I've just suspended it.
Do you see it indexing? If you run:
    balooctl monitor
does it report files being indexed? Should happen in batches of 40.

Could you be running BTRFS? There is a bug where BTRFS discs were mounted with "varying" device numbers, the device number wasn't stable reboot to reboot. Baloo uses a combination of the device number and inode for an internal "ID" for indexed files, if it sees a file "reappear" with a different ID, it thinks it's a new file and it should be indexed again. This caught OpenSUSE people a lot and then Fedora a little. There's a patch on the way.

Final thing to try, as mentioned in comment 9, is to run:
    systemctl status --user kde-baloo.service
and see if the Memory  (RAM) is being constrained to 512M. This can slow down indexing to a crawl, particularly when baloo starts to swap. There's a balancing act here, I've changed my MemoryHigh to 50% (and MemorySwapMax to 0) with
    systemctl edit --user kde-baloo.service
Comment 13 dietervdwes 2023-09-22 05:27:33 UTC
(In reply to tagwerk19 from comment #12)
> (In reply to dietervdwes from comment #11)
> > System continually seems to run the baloo_file_extractor, quite frustrating so I've just suspended it.
> Do you see it indexing? If you run:
>     balooctl monitor
> does it report files being indexed? Should happen in batches of 40.
> 
> Could you be running BTRFS? There is a bug where BTRFS discs were mounted
> with "varying" device numbers, the device number wasn't stable reboot to
> reboot. Baloo uses a combination of the device number and inode for an
> internal "ID" for indexed files, if it sees a file "reappear" with a
> different ID, it thinks it's a new file and it should be indexed again. This
> caught OpenSUSE people a lot and then Fedora a little. There's a patch on
> the way.
> 
> Final thing to try, as mentioned in comment 9, is to run:
>     systemctl status --user kde-baloo.service
> and see if the Memory  (RAM) is being constrained to 512M. This can slow
> down indexing to a crawl, particularly when baloo starts to swap. There's a
> balancing act here, I've changed my MemoryHigh to 50% (and MemorySwapMax to
> 0) with
>     systemctl edit --user kde-baloo.service

Thanks for the advice @tagwerk19@innerjoin.org! 
- Using ext4 filesystem on a 500 Gb SSD. 
- In task manager it seems to use ~6Gb of ram (of 16).
- It seems like it did correctly now when the power plug of laptop was removed (which didn't seem to happen previously) and started again when I plugged it in. 
-It seems to index quite a lot of cache files etc. Will search around if possible to restrict indexing of certain folders, e.g.:

Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/00283fc3ee9c5ea7_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/23a15434b2138b9a_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/3db4f9689a74b257_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/e672388e6f5f77e4_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/1eff7d439b2e4b3c_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/c53849efe36a4cc6_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/2acb2a00b0fb6992_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/9c3f563612e461f6_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/9a1a919e044c1354_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/304857e39b157c35_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/b1d3fec74b3f4f98_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/46340e3d2df5b165_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/e0a41d38b2aea0f0_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/327824f1949fc9b3_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/233c36800c5209ff_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/e9cb387796466985_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/814dbb36d4e9b40d_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/4d0b64368efc240a_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/e567041e16f71b7f_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/f5bba5d32579072d_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/da787546514876b6_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Code Cache/js/9ae5a9b974e524b3_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/f4beda8473c0e78b_0: Ok
Indexing: /home/dieter/.config/google-chrome/Default/Service Worker/CacheStorage/d55a62ee4934dd0a67863044121c781b05e4f716/f22490e6-d91f-4273-be81-98f26c6966d0/a184eabc3b57ad8c_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/c7da8b20dd325d1a_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/b345c172ded21244_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/a68920e68d520f0e_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/524955e6f54a61fa_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/9dbb1301c3ee55b4_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/b1369a74ae8446f2_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/813fe8f08248c179_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/6a875ccda31c4a0f_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/13f5183ba5587dfc_0: Ok
Indexing: /home/dieter/.config/google-chrome/Default/Service Worker/CacheStorage/d55a62ee4934dd0a67863044121c781b05e4f716/f22490e6-d91f-4273-be81-98f26c6966d0/e6c97907d5a7e71c_1: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/bae3495d40a21ebf_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Code Cache/js/fdc4f366870daa4c_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/c9378570e252edac_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/ad197d4d75c10818_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/6ad7d699c9129519_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/c0f2678d6ddba309_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/47636d6f10c1745a_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/d481bbb76bdd6293_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/155fb8b05c934a85_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/f3dd65a9959b6da6_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/52f5c13cb374fd7d_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/16c00a9369d87c23_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/84a7aa4b8d92a5fe_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/d050372f244e475f_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/435344909b381574_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/1aa212e23c789916_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/68831b9972d75863_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/fdea91b5b0c66970_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/3ded238b254e8923_0: Ok
Indexing: /home/dieter/.config/google-chrome/Default/Service Worker/CacheStorage/d55a62ee4934dd0a67863044121c781b05e4f716/f22490e6-d91f-4273-be81-98f26c6966d0/e41978e4f01f6216_0: Ok
Indexing: /home/dieter/.cache/google-chrome/Default/Cache/Cache_Data/e69d228168e25820_0: Ok
Comment 14 tagwerk19 2023-09-22 07:13:21 UTC
(In reply to dietervdwes from comment #13)
> ... It seems like it did correctly now when the power plug of laptop was
> removed (which didn't seem to happen previously) and started again when I
> plugged it in ...
Yes, it should notice when on battery and stop content indexing. It finishes its current batch of files though.

> ... It seems to index quite a lot of cache files etc ...
Ahhh! yes. If you've configured indexing hidden files/folders you could catch a *lot* of files you don't know about. Have a look at Bug 434705 (even if we didn't find exactly what was happening on one particular case).
Comment 15 sourcemaker 2023-12-21 09:27:16 UTC
I'm currently trying to index the Akonadi directory with all emails.
Indexing takes far too long and never ends.
Comment 16 tagwerk19 2023-12-21 10:05:49 UTC
(In reply to sourcemaker from comment #15)
> I'm currently trying to index the Akonadi directory with all emails.
> Indexing takes far too long and never ends.
These are separate .eml files? or a big .mbox files with concatenated mails?

In both cases watch out for encoded attachments, you can be indexing strings you'll never want to search for.

Have a look at Bug 460882. If Akonadi stores mail in .eml or .mbox, you might want to append a comment to that...
Comment 17 sourcemaker 2023-12-21 10:30:16 UTC
It's stored as maildir.
Comment 18 tagwerk19 2023-12-21 11:30:19 UTC
(In reply to sourcemaker from comment #17)
> It's stored as maildir.
So there'll be some/several/many .mbox files (application/mbox without the .mbox suffix)

Baloo will try to index them (but won't separate out individual messages, you'll just get the .mbox file as a result). You will have trouble with encoded message parts (including attachments) giving Baloo a lot to do... and Baloo will also attempt to index the file however big it is, so a 1GB .mbox will probably kill it.

There's a now a pretty strict systemd limit on Baloo (see "systemctl --user status kde-baloo"), it caps the RAM usage to 512 MB. This could significantly slow indexing of large .mbox. Check to see how big your files are and think about changing the memory cap. My preferences (for what they are worth) are to increase the cap to 50% and prevent Baloo using Swap:

    MemoryHigh=50%
    MemorySwapMax=0B