Bug 354636 - baloo_file_extractor consumes an ever-increasing amount memory after upgrade to frameworks 5.80.0
Summary: baloo_file_extractor consumes an ever-increasing amount memory after upgrade ...
Status: RESOLVED WORKSFORME
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Baloo File Daemon (other bugs)
Version First Reported In: 5.80.0
Platform: Neon Linux
: NOR normal
Target Milestone: ---
Assignee: Vishesh Handa
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-31 09:09 UTC by Stéphane ANCELOT
Modified: 2024-08-10 03:46 UTC (History)
7 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:


Attachments
System Monitor showing baloo processes (50.68 KB, image/png)
2021-03-17 01:37 UTC, Michael
Details
htop showing baloo processes (180.73 KB, image/png)
2021-03-17 01:37 UTC, Michael
Details
System Monitor showing baloo processes as a history graph (207.66 KB, image/png)
2021-03-17 01:38 UTC, Michael
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stéphane ANCELOT 2015-10-31 09:09:26 UTC
Hi,
for unattended reason, my system was slow.

I checked and found baloo_file_extractor was consuming 1.2Go memory !!!!

=> baloo indexing is now disabled on my system....(furthermore, I do not need it .... )

It should be possible to act on memory / indexing settings.

Regards
Steph

Reproducible: Always
Comment 1 Mykola Krachkovsky 2015-11-26 08:05:41 UTC
Same for (just it could eats more than 3GB memory), I believe that bug: https://bugs.kde.org/show_bug.cgi?id=332421 — should be reopened.
Comment 2 Vishesh Handa 2015-12-14 22:48:28 UTC
I'm afraid a generic "consumes too much memory" doesn't give us much information on how to fix this. This bug is very specific to a kind of file which was being indexed. It could even be a bug in the underlying library used to fetch the metadata from the file.

Please reopen the bug if you're willing to provide more information.

Relevant info which could be useful -
1. Try reproducing the issue with a fresh index (balooctl disable && balooctl enable) and a good way to start.
2. Try excluding some folders which are being indexed, and possibly try and track the file down.
Comment 3 Stéphane ANCELOT 2015-12-15 07:49:26 UTC
Maybe you are not using your computer to store files ....

take a whole directory (almost 1g and so) of files with pictures (raw photos, jpg high and  low densite) ,svg files,  libreoffice files, development c/c++ files , eg android source code and so on ...
a thunderbird imap account with more than 1000 messages .
There are lot of files generated , that  we can not master in our computer.
I mean a classic developer computer..... 
you may have not tried using 1gb and more files.
This should easily trigger some problems
Comment 4 Mykola Krachkovsky 2015-12-17 07:37:20 UTC
(In reply to Vishesh Handa from comment #2)
> 2. Try excluding some folders which are being indexed, and possibly try and
> track the file down.
This doesn't help. Even diabling all home doesn't imply on baloo_file_extractor memory eating.

> 1. Try reproducing the issue with a fresh index (balooctl disable &&
> balooctl enable) and a good way to start.
This looks help, at least now baloo_file_extractor consume reasonable ammount of RAM.
So ATM it looks like some old index files bug. I'll send more information if it'll begin to eat memory again.
Comment 5 Michael 2021-03-17 01:34:38 UTC
This is an old bug, but after the recent upgrade to Frameworks 5.80.0, Baloo gobbles up a tremendous amount of memory as it re-indexes all files in $HOME folder. It doesn't appear to free memory as it indexes. My system becomes laggy and unresponsive. 


STEPS TO REPRODUCE

1. You may need a $HOME directory with a lot of files. My personal case is 318GB of data, ~1M files, ~76K subdirectories, of all types of documents, audio, video.
2. Upgrade to Frameworks 5.80.0, reboot. 
3. Upon logging in, Baloo will want to re-index all files. Use "System Monitor" to observe baloo_file_extractor and baloo_file. 


OBSERVED RESULT

Notice that the "Shared Memory" attribute of baloo_file_extractor continuously rises, not staying steady or falling. The "Memory" attribute is similary high at 1G. In my case Shared Memory gets easily gets to 3.4G and my swap file becomes 5.4G, with no other apps running.


EXPECTED RESULT

Memory usage of Baloo should not keep rising and affecting the swap file. Memory usage should be constant.


SOFTWARE/OS VERSIONS

Operating System: KDE neon 5.21
KDE Plasma Version: 5.21.2
KDE Frameworks Version: 5.80.0
Qt Version: 5.15.2
Kernel Version: 5.4.0-67-generic
OS Type: 64-bit
Graphics Platform: X11
Memory: 7.7 GiB of RAM



ADDITIONAL INFORMATION

I'm attaching some screenshots of System Monitor and htop. Notice the memory, CPU and swap usage. They all are very high and as a side-effect, my laptop is not responsive. I had to let it run overnight to finish indexing, but even then, when I rebooted, it wanted to re-index everything again(!).
Comment 6 Michael 2021-03-17 01:37:06 UTC
Created attachment 136761 [details]
System Monitor showing baloo processes
Comment 7 Michael 2021-03-17 01:37:48 UTC
Created attachment 136763 [details]
htop showing baloo processes
Comment 8 Michael 2021-03-17 01:38:39 UTC
Created attachment 136764 [details]
System Monitor showing baloo processes as a history graph
Comment 9 Oded Arbel 2021-03-22 10:26:27 UTC
For me baloo has been acting up every now and then, and I'd like to finally get to the bottom of this. The behavior is currently catatonic and after posting this issue I will kill baloo but not change any configuration or data, so the behavior can be reproduced if someone wants to continue research on this issue.

1. baloo_file_extractor takes a lot of CPU and memory. Here's its line in htop:
----8<----
PID   USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
74566 odeda     39   19  257G 10.1G 9006M S 102. 32.2  8h02:42 /usr/bin/baloo_file_extractor
----8<----

(its a 4 core system, CPU usage looks like a single thread that tries to take up an entire CPU but is slowed down a bit by IO on my fast NVME and the "over 100%" is a sampling error on the part of htop)

2. `balooctl monitor` shows almost no activity, and from time to time bursts of a couple dozen entries that look like this:

----8<----
Indexing: /home/odeda/.cache/mozilla/firefox/i1m74zv1.default/cache2/entries/CE2BB927E036CFCEE27E7795DFB198E7C41A14B6: Ok
----8<----

It should not be indexing `~/.cache` as `~/.config/baloofilerc` has this:

exclude folders[$e]=$HOME/.cache/,$HOME/mnt/,$HOME/snap/,[and a few other things]

The weird excluded folder behavior may has something to do with the fact I have a trailing slash on my $HOME:

----8<----
$ balooctl config show excludeFolders
kf.baloo: Folder cache: std::vector("/home/odeda//.cache/": excluded, "/home/odeda//snap/": excluded, "/home/odeda//mnt/": excluded, "/home/odeda/": included)
/home/odeda//.cache/
/home/odeda//snap/
/home/odeda//mnt/
----8<----

3. The index file is huge - about 19GB, which doesn't make a lot of sense to me. `balooctl indexSize` has this to say:

----8<----
File Size: 18.75 GiB
Used:      948.13 MiB

           PostingDB:       2.93 GiB   316.627 %
          PositionDB:      85.44 MiB     9.011 %
            DocTerms:       1.39 GiB   149.920 %
    DocFilenameTerms:     152.72 MiB    16.107 %
       DocXattrTerms:       8.39 MiB     0.885 %
              IdTree:      35.69 MiB     3.764 %
          IdFileName:     175.18 MiB    18.476 %
             DocTime:      92.85 MiB     9.793 %
             DocData:      43.49 MiB     4.587 %
   ContentIndexingDB:     448.00 KiB     0.046 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:      26.48 MiB     2.793 %
----8<----

and to that I can only say "wahhh?!?!?"

Here's also `balooctl status`:

----8<----
Baloo File Indexer is running
Indexer state: Indexing file content
Total files indexed: 2,103,903
Files waiting for content indexing: 6,832
Files failed to index: 0
Current size of index is 18.75 GiB
----8<----
Comment 10 Oded Arbel 2021-03-22 11:16:35 UTC
> 3. The index file is huge - about 19GB, which doesn't make a lot of sense to
> me. `balooctl indexSize` has this to say:
> 
> ----8<----
> File Size: 18.75 GiB
> Used:      948.13 MiB
> 
>            PostingDB:       2.93 GiB   316.627 %
>           PositionDB:      85.44 MiB     9.011 %
>             DocTerms:       1.39 GiB   149.920 %
>     DocFilenameTerms:     152.72 MiB    16.107 %
>        DocXattrTerms:       8.39 MiB     0.885 %
>               IdTree:      35.69 MiB     3.764 %
>           IdFileName:     175.18 MiB    18.476 %
>              DocTime:      92.85 MiB     9.793 %
>              DocData:      43.49 MiB     4.587 %
>    ContentIndexingDB:     448.00 KiB     0.046 %
>          FailedIdsDB:            0 B     0.000 %
>              MTimeDB:      26.48 MiB     2.793 %
> ----8<----
> 
> and to that I can only say "wahhh?!?!?"

After reviewing the code at https://github.com/KDE/baloo/blob/master , I'm more befuddled by the above numbers:

1. "Used" is `DatabaseSize.expectedSize`
2. The percentages are computed by 100 * "entry size" / "Used", so the 316% makes sense as it is larger than "Used".
3. `DatabaseSize.expectedSize` is calculated (src/engine/transaction.cpp:474) by adding up the sizes of all of the entries listed!! so it cannot be smaller than the sum of its parts, unless one of the parts is negative - which it can't be as the sizes are of type `size_t`, which - unless something really weird is going on in the build server - should be unsigned long int.

There's something about page sizes, but that isn't relevant to the above calculation which seem to suggest that a/(a+b) > 1 where both a and b are non-negative integers.

BTW - here's the result of running the `mdb_stat` tool from lmdb-utils on the baloo index:

----8<----
$ mdb_stat -af <path-to-index-db>
Freelist Status
  Tree depth: 2
  Branch pages: 1
  Leaf pages: 41
  Overflow pages: 5046
  Entries: 3253
  Free pages: 2566315
Status of Main DB
  Tree depth: 1
  Branch pages: 0
  Leaf pages: 1
  Overflow pages: 0
  Entries: 12
Status of docfilenameterms
  Tree depth: 4
  Branch pages: 315
  Leaf pages: 38726
  Overflow pages: 0
  Entries: 2104603
Status of docterms
  Tree depth: 4
  Branch pages: 633
  Leaf pages: 79407
  Overflow pages: 284028
  Entries: 2103699
Status of documentdatadb
  Tree depth: 3
  Branch pages: 90
  Leaf pages: 11012
  Overflow pages: 38
  Entries: 664790
Status of documenttimedb
  Tree depth: 3
  Branch pages: 187
  Leaf pages: 23555
  Overflow pages: 0
  Entries: 2111124
Status of docxatrrterms
  Tree depth: 3
  Branch pages: 21
  Leaf pages: 2040
  Overflow pages: 86
  Entries: 31253
Status of failediddb
  Tree depth: 0
  Branch pages: 0
  Leaf pages: 0
  Overflow pages: 0
  Entries: 0
Status of idfilename
  Tree depth: 4
  Branch pages: 363
  Leaf pages: 44411
  Overflow pages: 0
  Entries: 2120309
Status of idtree
  Tree depth: 3
  Branch pages: 52
  Leaf pages: 6960
  Overflow pages: 2118
  Entries: 223613
Status of indexingleveldb
  Tree depth: 3
  Branch pages: 3
  Leaf pages: 49
  Overflow pages: 0
  Entries: 5471
Status of mtimedb
  Tree depth: 3
  Branch pages: 42
  Leaf pages: 6719
  Overflow pages: 0
  Entries: 2111124
Status of positiondb
  Tree depth: 4
  Branch pages: 6657
  Leaf pages: 735531
  Overflow pages: 328761
  Entries: 42876611
Status of postingdb
  Tree depth: 4
  Branch pages: 6181
  Leaf pages: 657348
  Overflow pages: 105167
  Entries: 45851508
----8<----
Comment 11 tagwerk19 2021-03-22 21:10:15 UTC
(In reply to Oded Arbel from comment #10)
> $ mdb_stat -af <path-to-index-db>
> Freelist Status
>   ...
>   Free pages: 2566315
If it says 2566315 free pages (and a page is 4K?), that's a lot of space in the file not being used.

Have you tried copying the index with mdb_copy?

I've just tried
    mdb_copy -n -c index index.copy
It certainly seems to think for a while but the index.copy was smaller by 'more or less' the count of the free pages.
Comment 12 Oded Arbel 2021-03-23 05:14:14 UTC
(In reply to tagwerk19 from comment #11)
> (In reply to Oded Arbel from comment #10)
> > $ mdb_stat -af <path-to-index-db>
> > Freelist Status
> >   ...
> >   Free pages: 2566315
> If it says 2566315 free pages (and a page is 4K?), that's a lot of space in
> the file not being used.
> 
> Have you tried copying the index with mdb_copy?
> 
> I've just tried
>     mdb_copy -n -c index index.copy
> It certainly seems to think for a while but the index.copy was smaller by
> 'more or less' the count of the free pages.

Shouldn't baloo "auto trim" the index by itself? This is not something a user would know to do. Also - doesn't explain the weird percentages.
Comment 13 tagwerk19 2021-03-23 12:59:52 UTC
(In reply to Oded Arbel from comment #12)
> Shouldn't baloo "auto trim" the index by itself? This is not something a
> user would know to do. Also - doesn't explain the weird percentages.
I'm reading
    http://www.lmdb.tech/doc/
Looks like if the database has 'grown' is does not shrink. Free pages are however reused. Question is whether this has an impact on performance...
Comment 14 tagwerk19 2021-03-25 07:02:34 UTC
(In reply to Oded Arbel from comment #9)
> The weird excluded folder behavior may has something to do with the fact I
> have a trailing slash on my $HOME:
> 
> ----8<----
> $ balooctl config show excludeFolders
> kf.baloo: Folder cache: std::vector("/home/odeda//.cache/": excluded,
> "/home/odeda//snap/": excluded, "/home/odeda//mnt/": excluded,
> "/home/odeda/": included)
> /home/odeda//.cache/
> /home/odeda//snap/
> /home/odeda//mnt/
> ----8<----

Oooh. Indeed.

If I "bend things" so I have a trailing slash in my $HOME, the include/exclude folders lines (for subfolders) in baloofilerc stop working.

If I include
    folders[$e]=$HOME
then a
    exclude folders[$e]=$HOME/.cache/
doesn't work

If I want to index a set of subfolders,
    folders[$e]=$HOME/Documents/,$HOME/Music/,$HOME/Pictures/,$HOME/Videos/
doesn't work.

It's not going to catch many people but it's probably worth reporting as a separate bug.
Comment 15 tagwerk19 2024-07-04 06:27:48 UTC
Revisiting after a fairly major set of patches, including using systemd/cgroups to limit memory use:
     https://invent.kde.org/frameworks/baloo/-/merge_requests/121
Together with a fix for the initial scan, when run within constrained memory
     https://invent.kde.org/frameworks/baloo/-/merge_requests/148

There also fixes for the BTFRS issues (which probably didn't exist when the call was opened but was affecting OpenSUSE by 2021...)
    https://invent.kde.org/frameworks/baloo/-/merge_requests/131
and cherrypicked for KF5
    https://invent.kde.org/frameworks/baloo/-/merge_requests/169

I'll set this to "Waiting for Info" in case anyone wants to keep the issue open....
Comment 16 Oded Arbel 2024-07-04 16:09:25 UTC
(In reply to tagwerk19 from comment #15)

On my system (KDE Neon testing, Plasma 6.1.3) system monitor reports baloo_file at 505MB, while systemd has this to say:

     Memory: 395.0M (high: 512.0M available: 116.9M)

I still think that's a lot for an idling indexer, but in my day to day (especially as I have a new beefy machine that laughs at applications taking a mere 0.5GB of RAM 😜) I am no longer troubled by baloo_file behavior. I'm not closing this report as it isn't mine, but you can chalk me up at the "satisfied enough" column.
Comment 17 tagwerk19 2024-07-05 04:56:29 UTC
(In reply to Oded Arbel from comment #16)
>      Memory: 395.0M (high: 512.0M available: 116.9M)
> 
> I still think that's a lot for an idling indexer, but in my day to day
> (especially as I have a new beefy machine that laughs at applications taking
> a mere 0.5GB of RAM 😜)
Maybe on your beefy machine, there's no memory pressure so Baloo's not having to release pages :-)
Comment 18 Michael 2024-07-11 02:31:01 UTC
From this thread, I checked my Baloo index size, which was 19GB(!) and decided that I didn't need full text search, just file name search capabilities. So I nuked it with:

balooctl6 purge

And now my index size is a comfortable 95MB and I don't have mysterious CPU bursts and my laptop fan isn't kicking in after I download a pdf.

Going forward, I am not enabling full text search on new Kubuntu installations that I set up for friends and clients until Baloo's index is tamed.
Comment 19 Bug Janitor Service 2024-07-26 03:46:08 UTC
Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!
Comment 20 Bug Janitor Service 2024-08-10 03:46:44 UTC
🐛🧹 This bug has been in NEEDSINFO status with no change for at least 30 days. Closing as RESOLVED WORKSFORME.