Bug 380456

Summary:	Suspected memory leak in baloo_file_extractor
Product:	[Frameworks and Libraries] frameworks-baloo	Reporter:	Gaël de Chalendar (aka Kleag) <kleagg>
Component:	Baloo File Daemon	Assignee:	Pinak Ahuja <pinak.ahuja>
Status:	REPORTED ---
Severity:	major	CC:	adam.m.fontenot+kde, allandk78, anohigisavay, d, info, joh82875, jtiemer, ottwolt, postix, reuben_p, stefan.bruens, tagwerk19
Priority:	NOR
Version First Reported In:	5.52.0
Target Milestone:	---
Platform:	Arch Linux
OS:	Linux
See Also:	https://bugs.kde.org/show_bug.cgi?id=437754 https://bugs.kde.org/show_bug.cgi?id=400704
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	Upon killing baloo_file_extractor, I suddenly have a lot more free memory. attachment-8819-0.html pdftotext results from https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC

Description Gaël de Chalendar (aka Kleag) 2017-06-02 08:09:28 UTC

Overview:

In top, I see the memory usage of baloo_file_extractor augmenting continuously during initial indexation:

    21775 gael      39  19  0,252t 5,299g 4,244g R  97,0 45,5  75:22.15 baloo_file_extr

Steps to Reproduce: 

1. Install KDE Neon on a machine with several thousand files, for example a development machine

2. Let baloo index for several hours

Actual Results: 

The indexation is progressing as the number of indexed files reported by balloctl status is getting higher:

    balooctl status
    Baloo File Indexer is running
    Indexer state: Inactif
    Indexed 361894 / 1513487 files
    Current size of index is 6,28 Gio

The memory usage progress

Expected Results: 
The memory usage should stay quasi constant. Problematic files should be ignored.

Build Date & Platform: 

Any KDE Neon since first install several weeks ago, current:
Package: baloo-kf5
Version: 5.34.0-0neon+16.04+build31

Comment 1 DDR 2018-01-28 05:34:47 UTC

Created attachment 110170 [details]
Upon killing baloo_file_extractor, I suddenly have a lot more free memory.

Comment 2 DDR 2018-01-28 05:42:25 UTC

Comment on attachment 110170 [details]
Upon killing baloo_file_extractor, I suddenly have a lot more free memory.

baloo_file_extractor always seems to use about 16gb of memory, allocated fairly quickly after I start my computer. Commands such as ` balooctl index * ` are unresponsive until I've killed the process.

I'm running Ubuntu 17.10, which is up-to-date as today. (2018-01-27)

I don't think the index is an issue, even if it was held entirely memory it wouldn't account for half the problem.
$ balooctl indexSize
Actual Size: 6.80 GiB
Expected Size: 5.04 GiB

When the memory usage is high (before the cliff in the attached image):
$ balooctl status
Baloo File Indexer is running
^C

After I kill baloo_file_extractor (after the cliff in the attached image):
$ balooctl status
Baloo File Indexer is running
Indexer state: Indexing file content
Indexed 356513 / 374337 files
Current size of index is 11.08 GiB

Let me know if I can provide any more information.

Thanks!

Comment 3 Michael Heidelbach 2018-02-01 12:21:15 UTC

(In reply to Gaël de Chalendar (aka Kleag) from comment #0)
> 1. Install KDE Neon on a machine with several thousand files, for example a
> development machine

baloo is having a hard time indexing plain text, because there are so many terms to extract. Also the backend database is memory based so I would expect memory consumption to rise during the process.
Please report memory usage when indexing is done. I'm really curious to see that.

>     Baloo File Indexer is running
>     Indexer state: Inactif
>     Indexed 361894 / 1513487 files
This is strange: There are a lot of files left to be indexed, but the indexer itself is idle?
Did you kill it? Probably not. I've encountered this many times. 
Anyway this behaviour definitely is worth scrutinizing. I'll do it when I'm more familiar with baloo's code.

For the time being, occasionally it helps to restart baloo with
$ balooctl stop
ensure baloo_file and baloo_file_extractor are not running
$ balooctl start

>     Current size of index is 6,28 Gio
With an index of that size searching might be a little slow. And your even half-way done :)
Not sure, but I have the feeling baloo wasn't designed for this and you're overburdening it.
I'm just trying to imagine what will happen when you enter 'const' in KRunner/Milou.

Comment 4 Michael Heidelbach 2018-02-01 12:35:08 UTC

(In reply to DDR from comment #2)
Commands such as ` balooctl index
> * ` are unresponsive until I've killed the process.
Please clarify 'unresponsive': Did you have to Ctrl-C?
You did this while indexing was in progress. 
$ balooctl index *
is probably waiting for indexing to finished before queing another batch.
And even then most likely you'll only get a lot of 'indexing done' messages.


> Baloo File Indexer is running
> Indexer state: Indexing file content
> Indexed 356513 / 374337 files
Please keep your cool, let indexing finish in peace. It's nearly done :-)
What kind of files are indexing? See Comment #3.
> 
> Let me know if I can provide any more information.
> 
> Thanks!

Comment 5 DDR 2018-03-07 09:49:32 UTC

OK, so I have just discovered the magic of ls -la /proc/1234/fd, where 1234 is the pid of baloo_file_extractor. 😎

baloo_file_extractor was busy on a 1.5GiB text file, production-aria-tables.sql, and then got stuck on its backup. I added these files to the ignore list, in File Search — System Settings, and the indexer has gotten on with life and is indexing the last few files it needs to. Unfortunately, as the file is a database dump of mlpforums.com, I cannot share it for reproduction due to confidentiality issues. Perhaps a partial dump of the kde bugs database would suffice for that purpose.

Comment 6 DDR 2018-03-07 10:19:01 UTC

> Please report memory usage when indexing is done. I'm really curious to see that.
About 1.1GiB, ~5% of the available system memory. Very reasonable.

> Did you kill it? Probably not. I've encountered this many times.
No, not as of the report. I did shortly after - it was that, or it killed me by swapping anything useful to disk.

> Please clarify 'unresponsive': Did you have to Ctrl-C?
Yes. I think balooctl it was waiting for baloo_file_extractor to provide some information, but the extractor never would. I think it was busy extracting. I don't have a way to ctrl-c file extractor, but I think when I send the equivalent signal it shuts down just fine. ("End Process" in system monitor.)

> With an index of that size searching might be a little slow. And your even half-way done :)
> Not sure, but I have the feeling baloo wasn't designed for this and you're overburdening it.

Searching is still lightning fast. It seems it was designed very well in that regard.
I was definitely overburdening it. I feel it really should have known better than to try to index a tremendous plain-text file, though. It is enthusiastic, it bit off significantly more than it could chew. The actual search index for the forum the database dump was from takes over a week to rebuild on the server, I imagine the more generalised search tool would be absolutely doomed in that endeavour. That, and a week of solid uptime is quite rare for me.


> I'm just trying to imagine what will happen when you enter 'const' in  KRunner/Milou.
Would Dolphin's ctrl-f suffice? It's up to 1158 folders and 115308 files. Somewhat amazingly, although the search results took a few minutes to populate, Dolphin itself is still perfectly responsive and I can scroll through the files just fine. Typing to select a file works both perfectly and instantly. Memory use remaned unremarkably low throughout the whole process, and didn't really change when I exited Dolphin.

All 374579 files have now finished indexing. The current size of the index is 11.08 GiB.

Comment 7 Allan Andersen 2018-09-11 07:11:16 UTC

Same issue here. Really annoying using cpu and lots of memory.

balooctl status
Baloo File Indexer is running
Indexer state: Idle
Indexed 98426 / 157132 files
Current size of index is 7,81 GiB

Process: baloo_file_extractor is using 9.2 GiB memory!

Its a developer machine with several thousand files.

Linux aa-Precision-3510 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 8 Johannes Tiemer 2018-09-27 09:36:27 UTC

I have the same issue on a current Arch Linux (baloo 5.50, kernel 4.18.9), I only noticed it after the last upgrade a week ago, due to the machine having a long uptime. lower versions of baloo ran just fine and I fail to remember making huge changes to the data to be indexed. 

Component/Version: baloo_file_extractor/5.50
Platform: Arch Linux
Kernel: 4.18.9

Issue: After KDE Startup baloo_file_extractor uses ever more RAM until it stalls the machine to a freeze once it uses all available RAM. Not even switching to another tty to kill the process is possible anymore.

Steps to reproduce: Start Plasma session and wait while watching the RAM use applet or htop. It takes around 3-4 minutes after startup to fill what is available of 16GB RAM.

Remedy: Sending SIGTERM end baloo_file_extractor and frees the RAM. baloo doesn't restart it.

Context Info: I found some sources who claim that .vdi files are a problem for baloo, I consequently excluded my VM directory from the search (and some others). It did not change anything about the issue.

[user@machine ~]$ balooctl status
Die Baloo-Dateiindizierung läuft
Indizierungsstatus: Dateiinhalt wird indiziert
91496/91624 Dateien indiziert
Der aktuelle Index hat eine Größe von 31,19 GiB

[user@machine ~]$ balooctl indexSize
Actual Size: 31,19 GiB
Expected Size: 14,32 GiB

           PostingDB:     616,27 MiB    25.946 %
          PositionDB:       1,96 GiB    84.367 %
            DocTerms:       2,37 GiB   101.981 %
    DocFilenameTerms:      11,44 MiB     0.482 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       1,48 MiB     0.062 %
          IdFileName:       8,32 MiB     0.350 %
             DocTime:       3,87 MiB     0.163 %
             DocData:      12,64 MiB     0.532 %
   ContentIndexingDB:      12,00 KiB     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       3,34 MiB     0.141 %

[user@machine ~]$ uname -a
Linux pica 4.18.9-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 19 21:19:17 UTC 2018 x86_64 GNU/Linux

[user@machine ~]$ pikaur -Qi baloo
Name                     : baloo
Version                  : 5.50.0-1
Beschreibung             : A framework for searching and managing metadata
Architektur              : x86_64
URL                      : https://community.kde.org/Frameworks
Lizenzen                 : LGPL
Gruppen                  : kf5
Stellt bereit            : Nichts
Hängt ab von             : kfilemetadata  kidletime  kio  lmdb
Optionale Abhängigkeiten : qt5-declarative: QML bindings [Installiert]
Benötigt von             : baloo-widgets  gwenview  plasma-desktop  plasma-mediacenter
Optional für             : plasma-workspace
In Konflikt mit          : Nichts
Ersetzt                  : Nichts
Installationsgröße       : 2,41 MiB
Packer                   : Antonio Rojas <arojas@archlinux.org>
Erstellt am              : Mo 03 Sep 2018 16:26:53 CEST
Installiert am           : Mo 10 Sep 2018 02:47:54 CEST
Installationsgrund       : Installiert als Abhängigkeit für ein anderes Paket
Installations-Skript     : Nein
Verifiziert durch        : Signatur

Comment 9 Johannes Tiemer 2018-09-27 09:38:34 UTC

D'oh, forgot to mention: CPU load is 100% on one single core until I kill the process.

I'll remember to look into what it's indexing when I boot next time.

Comment 10 Johannes Tiemer 2018-09-28 17:59:34 UTC

I checked. It took baloo_file_extractor 9 minutes (according to uptime) to fill 13GiB RAM, where it pretty much exclusively (as far as I could tell) operated on my Archive disk, which, among others, contains lots of txt/csv-files (large datasets, probably a little below 100GiB) and my email backups from thunderbird which are a mess of many tens of thousands of small files.

I blacklisted a part of it now and will report back once I find out something new about its behaviour.

Comment 11 Johannes Tiemer 2018-09-29 21:37:20 UTC

After excluding the above mentioned folders with lots of small files, baloo stopped its memory eating behavior. Scanning for file numbers and sizes and then warning might be a simple safeguard maybe?

Comment 12 Johannes Tiemer 2018-12-04 12:40:05 UTC

Hey everybody,
since installing the update to version 5.52 on my computer (Arch current), the baloo file indexer began showing unwanted behaviour again. All directories with large text files were blacklisted beforehand.
The behaviour found is as before: RAM usage explodes within roughly minute to fill all of 16GB, since I have no swap, it then freezes my machine by clogging the RAM … playing "nice" with RAM may be a thing too.

What I found with "balooctl monitor"
- it seemed to plainly ignore that it should _not_ index the windows partition that I have mounted into my home for convenience
- it seems to begin expand in RAM while reporting that it is checking for "checking for obsolete index entries"

I let baloo completely recreate its index over a few days when I realised it is misbehaving again. See my above comment for earlier indexSize:
---
$ balooctl indexSize
Actual Size: 32,88 GiB
Expected Size: 22,85 GiB

           PostingDB:       2,31 GiB    81.336 %
          PositionDB:       1,48 GiB    51.905 %
            DocTerms:       3,71 GiB   130.294 %
    DocFilenameTerms:      57,95 MiB     1.989 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       7,63 MiB     0.262 %
          IdFileName:      40,62 MiB     1.394 %
             DocTime:      18,80 MiB     0.645 %
             DocData:      53,23 MiB     1.827 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:      10,35 MiB     0.355 %
---

Comment 13 Savor d'Isavano 2019-10-17 11:13:26 UTC

As of baloo 5.63.0, the issue persists.

Memory consumption increases by ~2MB/s. CPU consumption is also considerable.

Luckily I noticed the CPU fan spinning noisily and disabled baloo before it was too late to save my work (>9GB memory at the time).

See this screencast:
https://vimeo.com/366988108

Comment 14 Adam Fontenot 2021-10-14 06:03:40 UTC

This is still a really common issue. I don't know that I've ever spoken to someone who uses KDE + Baloo with the out of the box settings who hasn't run into it. I mean, just for a starting sample:

https://old.reddit.com/r/kde/comments/j77j16/can_we_please_have_kde_disable_baloo_by_default/
https://old.reddit.com/r/kde/comments/kzdoux/baloo_should_be_suspended_when_the_system_is_in/
https://old.reddit.com/r/kde/comments/lgg0su/how_is_baloo_doing_these_days/
https://old.reddit.com/r/kde/comments/o6w0ly/whats_wrong_with_baloo/
https://old.reddit.com/r/kde/comments/pc4wk1/baloo_file_extr_extreme_cpu_usage/

This is just the first five issues I could find with people talking about this *exact* problem, but the list goes on and on. The oldest complaint in that list is only a year old.

More than a complaint, I have a proposal: it is completely unreasonable for a file indexer to ever make a user's system unusable. Any time it takes baloo_file_extractor more than 30 seconds to pull the text out of a file, or it starts using more than 10% of the user's total RAM, it should be instantly killed and the file blacklisted. Only the file name (not contents) should be available to search results.

Moreover, some kind of heuristic is desperately needed to tell Baloo that a file can't be usefully indexed. Baloo is happy to use a ton of memory and hard disk space to index files that are - for most purposes - random binary data.

Just as an example: I have a PDF that contains no meaningful text at all (it's a plot automatically generated from some technical data). It's only 20 MB. Yet baloo_file_extractor hung on this file for a *long* time, probably more than half an hour, with RAM use up over 1 GB. It continued using 100% of one CPU core despite the fact that I was trying to run a full screen game at the time.

Comment 15 DDR 2021-10-14 06:26:08 UTC

Created attachment 142418 [details]
attachment-8819-0.html

I second this. It's a bit absurd to just run it with no resource limits,
internally or externally.

On Wed., Oct. 13, 2021, 11:04 p.m. Adam Fontenot, <bugzilla_noreply@kde.org>
wrote:

> https://bugs.kde.org/show_bug.cgi?id=380456
>
> Adam Fontenot <adam.m.fontenot+kde@gmail.com> changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|
> |adam.m.fontenot+kde@gmail.c
>                    |                            |om
>
> --- Comment #14 from Adam Fontenot <adam.m.fontenot+kde@gmail.com> ---
> This is still a really common issue. I don't know that I've ever spoken to
> someone who uses KDE + Baloo with the out of the box settings who hasn't
> run
> into it. I mean, just for a starting sample:
>
>
> https://old.reddit.com/r/kde/comments/j77j16/can_we_please_have_kde_disable_baloo_by_default/
>
> https://old.reddit.com/r/kde/comments/kzdoux/baloo_should_be_suspended_when_the_system_is_in/
> https://old.reddit.com/r/kde/comments/lgg0su/how_is_baloo_doing_these_days/
> https://old.reddit.com/r/kde/comments/o6w0ly/whats_wrong_with_baloo/
>
> https://old.reddit.com/r/kde/comments/pc4wk1/baloo_file_extr_extreme_cpu_usage/
>
> This is just the first five issues I could find with people talking about
> this
> *exact* problem, but the list goes on and on. The oldest complaint in that
> list
> is only a year old.
>
> More than a complaint, I have a proposal: it is completely unreasonable
> for a
> file indexer to ever make a user's system unusable. Any time it takes
> baloo_file_extractor more than 30 seconds to pull the text out of a file,
> or it
> starts using more than 10% of the user's total RAM, it should be instantly
> killed and the file blacklisted. Only the file name (not contents) should
> be
> available to search results.
>
> Moreover, some kind of heuristic is desperately needed to tell Baloo that a
> file can't be usefully indexed. Baloo is happy to use a ton of memory and
> hard
> disk space to index files that are - for most purposes - random binary
> data.
>
> Just as an example: I have a PDF that contains no meaningful text at all
> (it's
> a plot automatically generated from some technical data). It's only 20 MB.
> Yet
> baloo_file_extractor hung on this file for a *long* time, probably more
> than
> half an hour, with RAM use up over 1 GB. It continued using 100% of one CPU
> core despite the fact that I was trying to run a full screen game at the
> time.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 16 Adam Fontenot 2021-11-16 10:44:42 UTC

I actually filed an upstream bug with Poppler for its handling of the specific PDF file I was seeing issues with. https://gitlab.freedesktop.org/poppler/poppler/-/issues/1173

Surprisingly, the Poppler devs say there's nothing wrong with Poppler here (despite the fact that their pdftotext tool hangs for over an hour on this file). That's because the R script which generated it is apparently using the "I" character repeatedly as part of a graph. I don't know why R does that, but it does.

Quoting the dev response:

> whether this bug is fixed or not baloo needs to understand that extracting the 
> text of a pdf file can take forever, and thus give up after X seconds/minutes

Obviously this is not going to correspond to everyone's issues, but it's an interesting example of the point I made:

> it is completely unreasonable for a file indexer to ever make a user's system 
> unusable. Any time it takes baloo_file_extractor more than 30 seconds to pull 
> the text out of a file, or it starts using more than 10% of the user's total 
> RAM, it should be instantly killed and the file blacklisted. Only the file name 
> (not contents) should be available to search results.

So in general, while there *may* be specific bugs with Baloo that need fixing or some crazy files that perhaps "shouldn't" exist, the probable cause of this problem for *most* users is that Baloo simply doesn't give up on trying to index a file when it really, really should.

Comment 17 tagwerk19 2021-11-16 12:36:12 UTC

(In reply to Adam Fontenot from comment #16)
> ... it is completely unreasonable for a file indexer to ever make a user's system 
> unusable. Any time it takes baloo_file_extractor more than 30 seconds to pull 
> the text out of a file, or it starts using more than 10% of the user's total 
> RAM, it should be instantly killed and the file blacklisted. Only the file name 
> (not contents) should be available to search results ...
OOoooo. Ouch!

If you look at htop, you'll see that baloo_file and baloo_file_extractor run with minimum priority. They'll yield to nearly everything that wants a CPU. They should take all the time they need without annoying anything else....

Memory usage is different, baloo "memory maps" the index and pulls pages from disc to memory as needed, they'll be "forgotten" again if the RAM is needed (and the pages have not been modified). You might see that baloo_file / baloo_file_extractor use a lot of memory but that can be "just cache".

The kicker is when indexing and you're building a *large* transaction, that might take a lot of memory (possibly, alas, stretching to swap). If you kill the process before the commit is done, you're condemning yourself to repeat the work.  On a system with Out Of Memory (OOM) protections, you might hit this.

You can see a little of what's happening (the switching between reading the source files and writing the updates to the index) with iotop.

> ... Surprisingly, the Poppler devs say there's nothing wrong with Poppler here
> (despite the fact that their pdftotext tool hangs for over an hour on this
> file). That's because the R script which generated it is apparently using
> the "I" character repeatedly as part of a graph. I don't know why R does
> that, but it does ...
I'm tempted to say that if this is a application generated file with little/no human readable information in it (that happens to be a PDF) it would make sense to have an application specific mimetype for it. Then that can be added to baloo's "exclude filters" list. I suspect though that if the file is generated by a script, that might not be possible.

> So in general, while there *may* be specific bugs with Baloo that need
> fixing or some crazy files that perhaps "shouldn't" exist, the probable
> cause of this problem for *most* users is that Baloo simply doesn't give up
> on trying to index a file when it really, really should.
Baloo does have a mechanism for flagging files as "failed" - "balooctl failed" will list them. I think that needs more love...

Comment 18 Adam Fontenot 2021-11-16 23:23:00 UTC

(In reply to tagwerk19 from comment #17)
> If you look at htop, you'll see that baloo_file and baloo_file_extractor run
> with minimum priority. They'll yield to nearly everything that wants a CPU.
> They should take all the time they need without annoying anything else....
Hmm, even assuming this is true, does the process suspend if the user is on battery? An otherwise idle system consuming 100% of a core for hours on end is sure to annoy the user even if it doesn't interfere with other processes.

I'd also point out that I discovered this issue (after several years of being vaguely aware of "baloo problems") when I saw stuttering in a full screen game. Alt-tabbing to htop showed baloo_file_extractor at 100%. Baloo may in theory yield to other processes, but it didn't prevent me from seeing issues.

> Memory usage is different, baloo "memory maps" the index and pulls pages
> from disc to memory as needed, they'll be "forgotten" again if the RAM is
> needed (and the pages have not been modified). You might see that baloo_file
> / baloo_file_extractor use a lot of memory but that can be "just cache".
If I'm not mistaken, that's just for internal Baloo memory usage, right? In my case, baloo_file_extractor is calling out to an external library (poppler), and that library is consuming an endlessly growing amount of memory (from 1-3 GB before I've killed it). It's probably safe to say that this memory usage is in the form of anonymous mappings which can't be reclaimed. Baloo *must* take that into account and kill the extractor process if it begins affecting system resources.

> I'm tempted to say that if this is a application generated file with
> little/no human readable information in it (that happens to be a PDF) it
> would make sense to have an application specific mimetype for it. Then that
> can be added to baloo's "exclude filters" list. I suspect though that if the
> file is generated by a script, that might not be possible.
In this case, it's a graph of some scientific data. Plotting scientific data to PDF or SVG (which both can have extractable text) is very common. In any case, it shouldn't be on the user to determine which files are causing problems (I had to use strace!) and exclude them. A file indexer should "just work".

Comment 19 tagwerk19 2021-11-17 09:21:05 UTC

(In reply to Adam Fontenot from comment #18)
> Hmm, even assuming this is true, does the process suspend if the user is on
> battery? An otherwise idle system consuming 100% of a core for hours on end
> is sure to annoy the user even if it doesn't interfere with other processes.
I'm pretty confident about the CPU priority and I know that baloo is aware that it is on battery (and avoids content indexing). What happens in your case, I'm afraid I don't know.

> I'd also point out that I discovered this issue (after several years of
> being vaguely aware of "baloo problems") when I saw stuttering in a full
> screen game. 
I would still suspect memory use rather than CPU as the underlying reason. There are situations where baloo is building a large transaction and requires lots of memory, there's a summary starting https://bugs.kde.org/show_bug.cgi?id=400704#c31. It's quite possible for systems to "hit the mud" in these cases.

> If I'm not mistaken, that's just for internal Baloo memory usage, right?
I'd say yes, the cases I've looked at were when indexing large text files and writing the results to the index.

> ... baloo_file_extractor is calling out to an external library
> (poppler), and that library is consuming an endlessly growing amount of
> memory (from 1-3 GB before I've killed it). It's probably safe to say that
> this memory usage is in the form of anonymous mappings which can't be
> reclaimed. Baloo *must* take that into account and kill the extractor
> process if it begins affecting system resources.
That's a *lot* of memory for a "pdf to text" conversion 8-]

You see the baloo_file_extractor RAM usage go up during the extraction and not come down when it is finished?

> In this case, it's a graph of some scientific data. Plotting scientific data
> to PDF or SVG (which both can have extractable text) is very common. In any
> case, it shouldn't be on the user to determine which files are causing
> problems (I had to use strace!) and exclude them.
Understood.

Could you see the culprit file in "System Settings > Search" (recent releases of baloo show the progress of the indexing there) or when running "balooctl monitor"?

In your use case, you could save your plots to a folder that was not indexed. Yes, I know, it's shouldn't be up to the user but in this case as a workround...

>  A file indexer should "just work".
Yup,  I think there's general agreement on that :-)

Comment 20 Adam Fontenot 2021-11-18 01:00:01 UTC

(In reply to tagwerk19 from comment #19)
> I would still suspect memory use rather than CPU as the underlying reason.
It's quite possible that you're right about that. I do know the game is sensitive to available memory, possibly because it runs on the internal Intel graphics chip.

> > ... baloo_file_extractor is calling out to an external library
> > (poppler), and that library is consuming an endlessly growing amount of
> > memory (from 1-3 GB before I've killed it). It's probably safe to say that
> > this memory usage is in the form of anonymous mappings which can't be
> > reclaimed. Baloo *must* take that into account and kill the extractor
> > process if it begins affecting system resources.
> That's a *lot* of memory for a "pdf to text" conversion 8-]
Yes, especially for a random 20 MB PDF I didn't even remember existed.

> You see the baloo_file_extractor RAM usage go up during the extraction and
> not come down when it is finished?
I have never been able to leave it for long enough to finish extracting from the file. It's possible I'd even get an out of RAM hang before then. The Poppler devs estimate at least 7 GB of RAM would be needed to extract text from this file. I even tested their pdftotext command on a system with plenty of RAM, and even then the issue is that it simply takes too long. I've left it running for over an hour on this one file before, and never seen it complete.

Moreover, they insist that it's not a bug on their end. The file, in their view, is pathological and the only reasonable solution is not to try to extract text from it. I think I understand that perspective: it's not every day that you come across a PDF with millions of "words" on a single page. So it's on Baloo to bail out if the process takes too long or consumes too much RAM. Here's the bug report I filed with them if you want to follow that conversation: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1173

> Could you see the culprit file in "System Settings > Search" (recent
> releases of baloo show the progress of the indexing there) or when running
> "balooctl monitor"?
Unfortunately, I don't remember. I do remember using lsof and friends to check that it was the only file Baloo had open. I may not have realized at the time that that feature had been added to the Baloo KCM.

Comment 21 tagwerk19 2021-11-23 13:40:25 UTC

Created attachment 143869 [details]
pdftotext results from https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC

(In reply to Adam Fontenot from comment #20)
> ... The file, in their view, is pathological ...
Applying a modicum of patience, running:

    nice -19 pdftotext QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf

took 37 hours on a machine with 16GB memory 8-]

The process gradually ate memory, reaching 10 GB. There wasn't an obvious impact on performance - but I would expect you'd see that bite when reaching the limits/starting to swap.

Attaching the output file - just in case anyone else wants to see the result.

When moving the source file to an indexed folder it was picked up by baloo and indexed by baloo_file_extractor. Similarly 37hrs and 10.1 GB.

Alas wasn't quick enough to notice what happened to the baloo_file_extractor memory usage when the indexing finished - the process terminated (and released memory) when it had nothing more to do

The details of the index records:

    $ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
    1546b20000fc01 64513 1394354 Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf [/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf]
            Mtime: 1637335759 2021-11-19T16:29:19
            Ctime: 1637335813 2021-11-19T16:30:13
            Cached properties:
                    Title: R Graphics Output
                    Document Generated By: R 3.6.0
                    Page Count: 1
                    Creation Date: 2019-09-13T11:01:30.000Z

    Internal Info
    Terms: 0 100000 150000 200000 50000 Mapplication Mpdf T5 X15-graphics X15-output X15-r X17-3.6.0 X17-r X18-1 X24-2019-09-13T11:01:30Z a1 a2 b1 b2 c graphics output qagr qchr qkel qpal r vcf − ●
    File Name Terms: Fpdf Fqmvqwhpuqke7retn5f9tisea7
    XAttr Terms:
    generator: 3.6.0 r
    pageCount: 1
    title: graphics output r
    creationDate: 2019-09-13T11:01:30Z

and...

    $ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt
    140a610000fc01 64513 1313377 Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt [/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt]
            Mtime: 1637519014 2021-11-21T19:23:34
            Ctime: 1637519014 2021-11-21T19:23:34
            Cached properties:
                    Line Count: 4352

    Internal Info
    Terms: 0 100000 150000 200000 50000 Mplain Mtext T5 T8 X20-4352 a1 a2 b1 b2 c qagr qchr qkel qpal vcf − ●
    File Name Terms: Fqmvqwhpuqke7retn5f9tisea7 Ftxt
    XAttr Terms:
    lineCount: 4352

So, for this instance, not a lot of indexable text but the metadata was recognised (in the PDF, it was not extracted to the text) and it was possible to search for the title:

    $ baloosearch "R Graphics Output"

or...

    $ baloosearch title:"R Graphics Output"
    /home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf

I think with enough RAM and patience baloo can cope with even this pathological test case but the the requirement definitely _is_ "enough Ram and patience". It would certainly make sense to be able to say to baloo_file_extractor "give up after 10 minutes" and flag the file as failed.

I'll update Bug 400704, which has become a collection point for these misbehavin' reports. See:

    https://bugs.kde.org/show_bug.cgi?id=400704#c31

and onwards.

Comment 22 Stefan Brüns 2024-03-27 15:13:06 UTC

(In reply to tagwerk19 from comment #21)
> Created attachment 143869 [details]
> pdftotext results from
> https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC
> 
> (In reply to Adam Fontenot from comment #20)
> > ... The file, in their view, is pathological ...
> Applying a modicum of patience, running:
> 
>     nice -19 pdftotext QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
> 
> took 37 hours on a machine with 16GB memory 8-]
> 
> The process gradually ate memory, reaching 10 GB. There wasn't an obvious
> impact on performance - but I would expect you'd see that bite when reaching
> the limits/starting to swap.

The long runtime is caused by some algorithmically bad implementation, i.e. O(n^2) were e.g. O(n log n) is sufficient. The huge memory footprint is caused by some problematic data arrangement and too greedy pre/overallocation.

I have filed two MRs [1],[2] for poppler, with both applied the extractions runs in ~50 seconds on my 3 year old laptop, with a peak memory consumption of 1.8 GByte.

[1] https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/1514  
[2] https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/1515

Comment 23 tagwerk19 2024-07-07 15:00:46 UTC

(In reply to Adam Fontenot from comment #14)
> ... Any time it takes
> baloo_file_extractor more than 30 seconds to pull the text out of a file, or
> it starts using more than 10% of the user's total RAM ...
For the record, there was a patch here:
    https://invent.kde.org/frameworks/baloo/-/merge_requests/124
that applied a limit through the systemd unit files (you do need to be on a system running systemd). You can see the effect if you run
    systemctl --user status kde-baloo
and keep an eye on the Memory line. The out of the box limit is 512M, it's possible that this is too low and Baloo will swap (you do not want this), you can add an override with
    systemctl --user edit kde-baloo
and add a
    [Service]
    MemoryHigh=25%
Basically it is "choose a number", you have to balance the needs of Baloo and the rest of your system.

> Just as an example: I have a PDF that contains no meaningful text at all
> (it's a plot automatically generated from some technical data). It's only 20
> MB. Yet baloo_file_extractor hung on this file for a *long* time, probably
> more than half an hour, with RAM use up over 1 GB. It continued using 100%
> of one CPU core despite the fact that I was trying to run a full screen game
> at the time.
As per Comment 22, this was fixed in the poppler source (thank you Stefan :-)

I'll set this to "WaitingForInfo", if you still have problems, add a comment and reset

Comment 24 Adam Fontenot 2024-07-08 20:41:37 UTC

With Baloo 6.3.0, I see the following issues, some of which are related to memory use. As I mentioned over on bug 460460, I just re-enabled Baloo to see if the issues I previously experienced with it were fixed.

1. After re-enabling Baloo from the System Settings UI, the UI does update to show the indexer running. Eventually I checked back and indexing appeared to be hung in a partially complete state, but baloo_file was not running.
2. I cleared the index (balooctl purge), and reenabled the indexer (balooctl enable).
3. After doing this, Baloo doesn't seem to hang, but bloats memory excessively (I'm currently seeing over 9 GB used by baloo_file_extractor). The limit set on the kde-baloo service is unmodified from the default (512 MB), so presumably enabling Baloo from the terminal has somehow bypassed the systemd service that is supposed to control it. That seems like a bug that should probably be fixed (but the memory leak is an issue regardless).
4. The index is once again pretty large, over 8 GB and I'm only about halfway done with content indexing. If there is a way to debug the index and see which files are using the most space I would love to provide more details, and this would allow me to selectively exclude certain directories as well.

I'm returning this issue to "reported" because of the memory problem in (3). I can tell from strace that baloo_file_extractor is actively scanning files, so this is *not* an issue where it's gotten hung on one problematic file, and probably indicates memory that has been leaked.

Comment 25 tagwerk19 2024-07-08 22:09:55 UTC

(In reply to Adam Fontenot from comment #24)
> ... The limit set on the kde-baloo service is unmodified from the default (512
> MB), so presumably enabling Baloo from the terminal has somehow bypassed the
> systemd service that is supposed to control it. That seems like a bug that
> should probably be fixed (but the memory leak is an issue regardless) ...
A "balooctl enable" doesn't start the service through systemd so escapes the memory limits, noted in Bug 488178. That needs a bit of love...

Possible that baloo_file was killed OOM, there might be hints in the journal. I tend towards a MemoryHigh limit of 25% rather than 512M.

However, if you got past the https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC PDF, then that's a small success...

> ... If there is a way to debug the index and
> see which files are using the most space I would love to provide more
> details, and this would allow me to selectively exclude certain directories
> as well ...

That's a good question. No, I've no idea whether that's possible. That sort of detail is hidden in the LMDB library...
 
> I'm returning this issue to "reported" because of the memory problem in (3).

Fine!

Comment 26 Adam Fontenot 2024-07-08 23:25:45 UTC

Further update on performance - the indexer is still running. Frequently, content indexing will pause for many minutes at a time, during which baloo_file will use 100% CPU and write to disk continuously for the whole time. The number of indexed files seems to be going down slightly - maybe Baloo is responding badly to me deleting a directory where I built some software?

Much more concerning is the total disk write. I started tracking with iotop when content indexing was about 50% complete (in terms of number of files), and I'm now 66% complete. In that time baloo_file has 505 GB of disk write, and baloo_file extractor has 71 GB of disk write. That's astounding for a content index that's "only" 8.4 GB. SMART says that my disk has 4.6 TB written in its entire history (about 8 months old), and only about 350 GB is in use on this partition right now.

It's entirely possible that half of the total write to my SSD was done *today* by Baloo. Given this I will almost certainly have to leave it disabled in the long term due to concerns about preserving my disk.

Happy to make another bug report for the disk write issue if needed.

Comment 27 Stefan Brüns 2024-07-08 23:39:16 UTC

Note, only the first of the two MRs [1],[2] for poppler has been merged. Unfortunately, progress on the poppler side is really slow lately.

So, until the second one is merged as well, you won't see any noticeable improvement.

[1] https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/1514  
[2] https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/1515

Comment 28 tagwerk19 2024-07-09 06:26:57 UTC

(In reply to Stefan Brüns from comment #27)
> Note, only the first of the two MRs [1],[2] for poppler has been merged.
Ooo...

That's one of the things which I find not *particularly* transparent. You can see (often) when there's been appropriate patch submitted, you sometime see when it's been merged, you don't really know what release it will appear in and you have to remember to check whether your distro has got to that release...

Sorry, that was a rant. Feel free to ignore :-}

Comment 29 tagwerk19 2024-07-09 06:41:26 UTC

(In reply to Adam Fontenot from comment #26)
> ... In that time baloo_file has 505 GB of disk write,
> and baloo_file extractor has 71 GB of disk write ...
Trying to get my mind round that... 

> ... maybe Baloo is responding badly to me deleting a directory where I built some software?
Yes, that could be the reason.

Baloo_file_extractor batches up its work, baloo_file does it when looking for new/changed files, maybe not when responding to iNotify events and I think it commits file by file after deletes.

I think there's an argument for increasing the "40 files" batch size for baloo_file_extractor, that would come with a little more memory use but cut down on the total writes. 

Stefan says that people think they know how Baloo works better than he does, the above is just from observation and guesswork :-)

Comment 30 tagwerk19 2024-07-09 06:46:29 UTC

(In reply to tagwerk19 from comment #29)
> ... I think it commits file by file after deletes ...
... Cross referencing Bug 442453

Comment 31 Adam Fontenot 2024-07-09 14:46:07 UTC

For what it's worth, 15 minutes after I posted my comment about i/o, it had gone up to 600 GB written by baloo_file (another 100 GB), and I stopped it with `balooctl suspend` followed by `balooctl disable`. Unfortunately, as I've gotten used to in the past, Baloo claimed to have suspended, but baloo_file continued using 100% CPU and wrote another 50 GB (!) to disk before I gave up and did `kill -9` as usual. I've given up on testing for the time being.

> I think it commits file by file after deletes.

It's possible that this is related, but I find it hard to believe that any reasonable database format would need to write 150+ GB to disk to delete entries from a database that was only 8 GB, even if you deleted every single entry in the database one at a time. Whatever the reason, it's also very frustrating to see Baloo hang (and not respond to requests to suspend, see above) after a perfectly normal activity like deleting some development files.

(One entirely theoretical possibility is that Baloo is performing "vacuuming" on the LMDB database after every single deletion, i.e. rewriting the database to release the cleared space back to the file system. Needless to say, this would be a terrible idea, so it doesn't seem very likely? But there are variations that seem more sane at first glance, but actually aren't, like vacuuming after every 1000 deletions.)

I'm willing to re-test if someone wants to provide a patch to batch up deletes in baloo_file.

> Possible that baloo_file was killed OOM, there might be hints in the journal. I tend towards a MemoryHigh limit of 25% rather than 512M.

Not clear whether this happened. It appears to have hung ~20 minutes after I started it, but I don't see any direct evidence of a crash. When I checked the next day baloo_file and baloo_file_extractor weren't consuming any CPU and the UI showed progress stuck at ~23%.

This is more of a question, but what is the *intent* behind the 512MB memory limit? I think that's an entirely reasonable upper bound on a file indexer, personally, but I'm not sure what's supposed to happen when indexing some file would cause the indexer to exceed that. It is supposed to intelligently skip the file? Crash and then continue with another file? Hang entirely and no longer make progress?

Comment 32 tagwerk19 2024-07-10 18:17:50 UTC

(In reply to Adam Fontenot from comment #31)
> I've given up on testing for the time being.
That's OK, thank you for your efforts.

> ... I find it hard to believe that any
> reasonable database format would need to write 150+ GB to disk to delete
> entries from a database that was only 8 GB ...
I'm sure its possible to come up with a better explanation; with less handwaving, more detail and probably more accurate, but...

The way that Baloo provides results for searches so quickly is that it jumps to the word in the database and pulls a page from disk that lists all the files that the word appears in. When you index a file, you extract a list of words, look up each word in the database, get the list of files it appears in, insert  this new file (as an ID rather than filename) into the list and save it back. Or rather it saves it in memory, waiting for a commit...

The same process happens in reverse when you delete a file, for each word the list is read, the ID removed and the list written back. For common words, these lists can be *large* (and there's position information to be considered as well so you can look for phrases as well as words)

Baloo_file_extractor sidesteps the problem by dealing with 40 files at once, the list is not read and written back (committed) after each file, the somewhat extended lists are written back after indexing 40 files. Baloo_file ought to do the same for deletes, I think it would make a difference.

> This is more of a question, but what is the *intent* behind the 512MB memory
> limit? I think that's an entirely reasonable upper bound on a file indexer,
> personally, but I'm not sure what's supposed to happen when indexing some file
> would cause the indexer to exceed that. It is supposed to intelligently skip
> the file? Crash and then continue with another file? Hang entirely and no
> longer make progress?
The 512M is a somewhat arbitrary external constraint and is probably OK in the majority of cases. The intent was to stop Baloo competing for memory with the rest of the system. As a technique, it works very well, it's just that the "limit chosen" is too tight in my view. The bugs that arrive here are the tough cases where Baloo really does need more space, and in a lot of these cases, setting a higher limit works.

From my experience, when Baloo approaches the limit, it starts dropping "clean" pages and rereading then when needed. What you see is a lot more reads. When it is indexing and is building a very large transaction, where it cannot drop dirty pages, it can swap (which is bad news) or the kernel responds ever more slowly when Baloo asks for more memory (which is bad news), or eventually the process is killed OOM (which is bad news)

As to whether Baloo knows whether it is hitting the limit, I think not, it is external to the code. Whether it *can* know, that's interesting but I don't know.

> I'm willing to re-test if someone wants to provide a patch to batch up deletes
> in baloo_file.
Thank you!

Comment 33 Adam Fontenot 2024-07-11 17:35:27 UTC

> The way that Baloo provides results for searches so quickly is that it jumps to the word in the database and pulls a page from disk that lists all the files that the word appears in. When you index a file, you extract a list of words, look up each word in the database, get the list of files it appears in, insert  this new file (as an ID rather than filename) into the list and save it back.

This is firmly in the realm of speculative feature requests, but this makes deletes sound extremely expensive... wouldn't it be much cheaper to save a hashset of deleted fileIDs, and then remove search results with these IDs before returning them? You could then clean up the database on a regular basis, say once a month, or when triggered by a balooctl command. This deleted files hashset would be small enough to keep in memory and hash table lookups are O(1) on average, so this wouldn't measurably slow down searches.

I think ordinary development work triggering 500 GB of disk write for a file indexer is probably not going to be usable for a lot of people.

Comment 34 tagwerk19 2024-07-11 18:39:56 UTC

(In reply to Adam Fontenot from comment #33)
> ... I think ordinary development work triggering 500 GB of disk write for a file
> indexer is probably not going to be usable for a lot of people ...
I think, with batching up of deletes, the total disk write for deleting "X" files would be comparable to when indexing them; and very much dependent on the size of the batches. Supposition though...

Comment 35 skierpage 2025-03-24 06:41:24 UTC

This is a 2017 bug but since it's still open
(In reply to DDR from comment #5)
> OK, so I have just discovered the magic of ls -la /proc/1234/fd, where 1234
> is the pid of baloo_file_extractor. 😎
> 
> baloo_file_extractor was busy on a 1.5GiB text file,

But as https://community.kde.org/Baloo#Indexing_limitations mentions, "Baloo doesn't index text files (those whose MIME type is detected as "text/something") over 10 MB". If you enable Baloo debugging (linked from that page), you should see  log output from
            qCDebug(BALOO) << "Skipping large " << url << "- mimetype:" << mimetype;
however, I have never seen this warning.

You're lucky; when "baloo_file_extractor high CPU usage, baloo stops indexing" for me (bug 488645), I don't see any indication of what file it's struggling to index in /proc/fd or `lsof -p`.