Bug 354636 - baloo_file_extractor consumes an ever-increasing amount memory after upgrade to frameworks 5.80.0
Summary: baloo_file_extractor consumes an ever-increasing amount memory after upgrade ...
Status: REOPENED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Baloo File Daemon (show other bugs)
Version: 5.80.0
Platform: Neon Linux
: NOR normal
Target Milestone: ---
Assignee: Vishesh Handa
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-31 09:09 UTC by Stéphane ANCELOT
Modified: 2021-03-25 07:02 UTC (History)
7 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
System Monitor showing baloo processes (50.68 KB, image/png)
2021-03-17 01:37 UTC, Michael
Details
htop showing baloo processes (180.73 KB, image/png)
2021-03-17 01:37 UTC, Michael
Details
System Monitor showing baloo processes as a history graph (207.66 KB, image/png)
2021-03-17 01:38 UTC, Michael
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stéphane ANCELOT 2015-10-31 09:09:26 UTC
Hi,
for unattended reason, my system was slow.

I checked and found baloo_file_extractor was consuming 1.2Go memory !!!!

=> baloo indexing is now disabled on my system....(furthermore, I do not need it .... )

It should be possible to act on memory / indexing settings.

Regards
Steph

Reproducible: Always
Comment 1 Mykola Krachkovsky 2015-11-26 08:05:41 UTC
Same for (just it could eats more than 3GB memory), I believe that bug: https://bugs.kde.org/show_bug.cgi?id=332421 — should be reopened.
Comment 2 Vishesh Handa 2015-12-14 22:48:28 UTC
I'm afraid a generic "consumes too much memory" doesn't give us much information on how to fix this. This bug is very specific to a kind of file which was being indexed. It could even be a bug in the underlying library used to fetch the metadata from the file.

Please reopen the bug if you're willing to provide more information.

Relevant info which could be useful -
1. Try reproducing the issue with a fresh index (balooctl disable && balooctl enable) and a good way to start.
2. Try excluding some folders which are being indexed, and possibly try and track the file down.
Comment 3 Stéphane ANCELOT 2015-12-15 07:49:26 UTC
Maybe you are not using your computer to store files ....

take a whole directory (almost 1g and so) of files with pictures (raw photos, jpg high and  low densite) ,svg files,  libreoffice files, development c/c++ files , eg android source code and so on ...
a thunderbird imap account with more than 1000 messages .
There are lot of files generated , that  we can not master in our computer.
I mean a classic developer computer..... 
you may have not tried using 1gb and more files.
This should easily trigger some problems
Comment 4 Mykola Krachkovsky 2015-12-17 07:37:20 UTC
(In reply to Vishesh Handa from comment #2)
> 2. Try excluding some folders which are being indexed, and possibly try and
> track the file down.
This doesn't help. Even diabling all home doesn't imply on baloo_file_extractor memory eating.

> 1. Try reproducing the issue with a fresh index (balooctl disable &&
> balooctl enable) and a good way to start.
This looks help, at least now baloo_file_extractor consume reasonable ammount of RAM.
So ATM it looks like some old index files bug. I'll send more information if it'll begin to eat memory again.
Comment 5 Michael 2021-03-17 01:34:38 UTC
This is an old bug, but after the recent upgrade to Frameworks 5.80.0, Baloo gobbles up a tremendous amount of memory as it re-indexes all files in $HOME folder. It doesn't appear to free memory as it indexes. My system becomes laggy and unresponsive. 


STEPS TO REPRODUCE

1. You may need a $HOME directory with a lot of files. My personal case is 318GB of data, ~1M files, ~76K subdirectories, of all types of documents, audio, video.
2. Upgrade to Frameworks 5.80.0, reboot. 
3. Upon logging in, Baloo will want to re-index all files. Use "System Monitor" to observe baloo_file_extractor and baloo_file. 


OBSERVED RESULT

Notice that the "Shared Memory" attribute of baloo_file_extractor continuously rises, not staying steady or falling. The "Memory" attribute is similary high at 1G. In my case Shared Memory gets easily gets to 3.4G and my swap file becomes 5.4G, with no other apps running.


EXPECTED RESULT

Memory usage of Baloo should not keep rising and affecting the swap file. Memory usage should be constant.


SOFTWARE/OS VERSIONS

Operating System: KDE neon 5.21
KDE Plasma Version: 5.21.2
KDE Frameworks Version: 5.80.0
Qt Version: 5.15.2
Kernel Version: 5.4.0-67-generic
OS Type: 64-bit
Graphics Platform: X11
Memory: 7.7 GiB of RAM



ADDITIONAL INFORMATION

I'm attaching some screenshots of System Monitor and htop. Notice the memory, CPU and swap usage. They all are very high and as a side-effect, my laptop is not responsive. I had to let it run overnight to finish indexing, but even then, when I rebooted, it wanted to re-index everything again(!).
Comment 6 Michael 2021-03-17 01:37:06 UTC
Created attachment 136761 [details]
System Monitor showing baloo processes
Comment 7 Michael 2021-03-17 01:37:48 UTC
Created attachment 136763 [details]
htop showing baloo processes
Comment 8 Michael 2021-03-17 01:38:39 UTC
Created attachment 136764 [details]
System Monitor showing baloo processes as a history graph
Comment 9 Oded Arbel 2021-03-22 10:26:27 UTC
For me baloo has been acting up every now and then, and I'd like to finally get to the bottom of this. The behavior is currently catatonic and after posting this issue I will kill baloo but not change any configuration or data, so the behavior can be reproduced if someone wants to continue research on this issue.

1. baloo_file_extractor takes a lot of CPU and memory. Here's its line in htop:
----8<----
PID   USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
74566 odeda     39   19  257G 10.1G 9006M S 102. 32.2  8h02:42 /usr/bin/baloo_file_extractor
----8<----

(its a 4 core system, CPU usage looks like a single thread that tries to take up an entire CPU but is slowed down a bit by IO on my fast NVME and the "over 100%" is a sampling error on the part of htop)

2. `balooctl monitor` shows almost no activity, and from time to time bursts of a couple dozen entries that look like this:

----8<----
Indexing: /home/odeda/.cache/mozilla/firefox/i1m74zv1.default/cache2/entries/CE2BB927E036CFCEE27E7795DFB198E7C41A14B6: Ok
----8<----

It should not be indexing `~/.cache` as `~/.config/baloofilerc` has this:

exclude folders[$e]=$HOME/.cache/,$HOME/mnt/,$HOME/snap/,[and a few other things]

The weird excluded folder behavior may has something to do with the fact I have a trailing slash on my $HOME:

----8<----
$ balooctl config show excludeFolders
kf.baloo: Folder cache: std::vector("/home/odeda//.cache/": excluded, "/home/odeda//snap/": excluded, "/home/odeda//mnt/": excluded, "/home/odeda/": included)
/home/odeda//.cache/
/home/odeda//snap/
/home/odeda//mnt/
----8<----

3. The index file is huge - about 19GB, which doesn't make a lot of sense to me. `balooctl indexSize` has this to say:

----8<----
File Size: 18.75 GiB
Used:      948.13 MiB

           PostingDB:       2.93 GiB   316.627 %
          PositionDB:      85.44 MiB     9.011 %
            DocTerms:       1.39 GiB   149.920 %
    DocFilenameTerms:     152.72 MiB    16.107 %
       DocXattrTerms:       8.39 MiB     0.885 %
              IdTree:      35.69 MiB     3.764 %
          IdFileName:     175.18 MiB    18.476 %
             DocTime:      92.85 MiB     9.793 %
             DocData:      43.49 MiB     4.587 %
   ContentIndexingDB:     448.00 KiB     0.046 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:      26.48 MiB     2.793 %
----8<----

and to that I can only say "wahhh?!?!?"

Here's also `balooctl status`:

----8<----
Baloo File Indexer is running
Indexer state: Indexing file content
Total files indexed: 2,103,903
Files waiting for content indexing: 6,832
Files failed to index: 0
Current size of index is 18.75 GiB
----8<----
Comment 10 Oded Arbel 2021-03-22 11:16:35 UTC
> 3. The index file is huge - about 19GB, which doesn't make a lot of sense to
> me. `balooctl indexSize` has this to say:
> 
> ----8<----
> File Size: 18.75 GiB
> Used:      948.13 MiB
> 
>            PostingDB:       2.93 GiB   316.627 %
>           PositionDB:      85.44 MiB     9.011 %
>             DocTerms:       1.39 GiB   149.920 %
>     DocFilenameTerms:     152.72 MiB    16.107 %
>        DocXattrTerms:       8.39 MiB     0.885 %
>               IdTree:      35.69 MiB     3.764 %
>           IdFileName:     175.18 MiB    18.476 %
>              DocTime:      92.85 MiB     9.793 %
>              DocData:      43.49 MiB     4.587 %
>    ContentIndexingDB:     448.00 KiB     0.046 %
>          FailedIdsDB:            0 B     0.000 %
>              MTimeDB:      26.48 MiB     2.793 %
> ----8<----
> 
> and to that I can only say "wahhh?!?!?"

After reviewing the code at https://github.com/KDE/baloo/blob/master , I'm more befuddled by the above numbers:

1. "Used" is `DatabaseSize.expectedSize`
2. The percentages are computed by 100 * "entry size" / "Used", so the 316% makes sense as it is larger than "Used".
3. `DatabaseSize.expectedSize` is calculated (src/engine/transaction.cpp:474) by adding up the sizes of all of the entries listed!! so it cannot be smaller than the sum of its parts, unless one of the parts is negative - which it can't be as the sizes are of type `size_t`, which - unless something really weird is going on in the build server - should be unsigned long int.

There's something about page sizes, but that isn't relevant to the above calculation which seem to suggest that a/(a+b) > 1 where both a and b are non-negative integers.

BTW - here's the result of running the `mdb_stat` tool from lmdb-utils on the baloo index:

----8<----
$ mdb_stat -af <path-to-index-db>
Freelist Status
  Tree depth: 2
  Branch pages: 1
  Leaf pages: 41
  Overflow pages: 5046
  Entries: 3253
  Free pages: 2566315
Status of Main DB
  Tree depth: 1
  Branch pages: 0
  Leaf pages: 1
  Overflow pages: 0
  Entries: 12
Status of docfilenameterms
  Tree depth: 4
  Branch pages: 315
  Leaf pages: 38726
  Overflow pages: 0
  Entries: 2104603
Status of docterms
  Tree depth: 4
  Branch pages: 633
  Leaf pages: 79407
  Overflow pages: 284028
  Entries: 2103699
Status of documentdatadb
  Tree depth: 3
  Branch pages: 90
  Leaf pages: 11012
  Overflow pages: 38
  Entries: 664790
Status of documenttimedb
  Tree depth: 3
  Branch pages: 187
  Leaf pages: 23555
  Overflow pages: 0
  Entries: 2111124
Status of docxatrrterms
  Tree depth: 3
  Branch pages: 21
  Leaf pages: 2040
  Overflow pages: 86
  Entries: 31253
Status of failediddb
  Tree depth: 0
  Branch pages: 0
  Leaf pages: 0
  Overflow pages: 0
  Entries: 0
Status of idfilename
  Tree depth: 4
  Branch pages: 363
  Leaf pages: 44411
  Overflow pages: 0
  Entries: 2120309
Status of idtree
  Tree depth: 3
  Branch pages: 52
  Leaf pages: 6960
  Overflow pages: 2118
  Entries: 223613
Status of indexingleveldb
  Tree depth: 3
  Branch pages: 3
  Leaf pages: 49
  Overflow pages: 0
  Entries: 5471
Status of mtimedb
  Tree depth: 3
  Branch pages: 42
  Leaf pages: 6719
  Overflow pages: 0
  Entries: 2111124
Status of positiondb
  Tree depth: 4
  Branch pages: 6657
  Leaf pages: 735531
  Overflow pages: 328761
  Entries: 42876611
Status of postingdb
  Tree depth: 4
  Branch pages: 6181
  Leaf pages: 657348
  Overflow pages: 105167
  Entries: 45851508
----8<----
Comment 11 tagwerk19 2021-03-22 21:10:15 UTC
(In reply to Oded Arbel from comment #10)
> $ mdb_stat -af <path-to-index-db>
> Freelist Status
>   ...
>   Free pages: 2566315
If it says 2566315 free pages (and a page is 4K?), that's a lot of space in the file not being used.

Have you tried copying the index with mdb_copy?

I've just tried
    mdb_copy -n -c index index.copy
It certainly seems to think for a while but the index.copy was smaller by 'more or less' the count of the free pages.
Comment 12 Oded Arbel 2021-03-23 05:14:14 UTC
(In reply to tagwerk19 from comment #11)
> (In reply to Oded Arbel from comment #10)
> > $ mdb_stat -af <path-to-index-db>
> > Freelist Status
> >   ...
> >   Free pages: 2566315
> If it says 2566315 free pages (and a page is 4K?), that's a lot of space in
> the file not being used.
> 
> Have you tried copying the index with mdb_copy?
> 
> I've just tried
>     mdb_copy -n -c index index.copy
> It certainly seems to think for a while but the index.copy was smaller by
> 'more or less' the count of the free pages.

Shouldn't baloo "auto trim" the index by itself? This is not something a user would know to do. Also - doesn't explain the weird percentages.
Comment 13 tagwerk19 2021-03-23 12:59:52 UTC
(In reply to Oded Arbel from comment #12)
> Shouldn't baloo "auto trim" the index by itself? This is not something a
> user would know to do. Also - doesn't explain the weird percentages.
I'm reading
    http://www.lmdb.tech/doc/
Looks like if the database has 'grown' is does not shrink. Free pages are however reused. Question is whether this has an impact on performance...
Comment 14 tagwerk19 2021-03-25 07:02:34 UTC
(In reply to Oded Arbel from comment #9)
> The weird excluded folder behavior may has something to do with the fact I
> have a trailing slash on my $HOME:
> 
> ----8<----
> $ balooctl config show excludeFolders
> kf.baloo: Folder cache: std::vector("/home/odeda//.cache/": excluded,
> "/home/odeda//snap/": excluded, "/home/odeda//mnt/": excluded,
> "/home/odeda/": included)
> /home/odeda//.cache/
> /home/odeda//snap/
> /home/odeda//mnt/
> ----8<----

Oooh. Indeed.

If I "bend things" so I have a trailing slash in my $HOME, the include/exclude folders lines (for subfolders) in baloofilerc stop working.

If I include
    folders[$e]=$HOME
then a
    exclude folders[$e]=$HOME/.cache/
doesn't work

If I want to index a set of subfolders,
    folders[$e]=$HOME/Documents/,$HOME/Music/,$HOME/Pictures/,$HOME/Videos/
doesn't work.

It's not going to catch many people but it's probably worth reporting as a separate bug.