Bug 380456 - Suspected memory leak in baloo_file_extractor
Summary: Suspected memory leak in baloo_file_extractor
Status: CONFIRMED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Baloo File Daemon (show other bugs)
Version: 5.52.0
Platform: Arch Linux Linux
: NOR major
Target Milestone: ---
Assignee: Pinak Ahuja
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-02 08:09 UTC by Gaël de Chalendar (aka Kleag)
Modified: 2024-03-27 15:13 UTC (History)
10 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Upon killing baloo_file_extractor, I suddenly have a lot more free memory. (25.89 KB, image/png)
2018-01-28 05:34 UTC, DDR
Details
attachment-8819-0.html (4.00 KB, text/html)
2021-10-14 06:26 UTC, DDR
Details
pdftotext results from https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC (18.21 KB, text/plain)
2021-11-23 13:40 UTC, tagwerk19
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gaël de Chalendar (aka Kleag) 2017-06-02 08:09:28 UTC
Overview:

In top, I see the memory usage of baloo_file_extractor augmenting continuously during initial indexation:

    21775 gael      39  19  0,252t 5,299g 4,244g R  97,0 45,5  75:22.15 baloo_file_extr

Steps to Reproduce: 

1. Install KDE Neon on a machine with several thousand files, for example a development machine

2. Let baloo index for several hours

Actual Results: 

The indexation is progressing as the number of indexed files reported by balloctl status is getting higher:

    balooctl status
    Baloo File Indexer is running
    Indexer state: Inactif
    Indexed 361894 / 1513487 files
    Current size of index is 6,28 Gio

The memory usage progress

Expected Results: 
The memory usage should stay quasi constant. Problematic files should be ignored.

Build Date & Platform: 

Any KDE Neon since first install several weeks ago, current:
Package: baloo-kf5
Version: 5.34.0-0neon+16.04+build31
Comment 1 DDR 2018-01-28 05:34:47 UTC
Created attachment 110170 [details]
Upon killing baloo_file_extractor, I suddenly have a lot more free memory.
Comment 2 DDR 2018-01-28 05:42:25 UTC
Comment on attachment 110170 [details]
Upon killing baloo_file_extractor, I suddenly have a lot more free memory.

baloo_file_extractor always seems to use about 16gb of memory, allocated fairly quickly after I start my computer. Commands such as ` balooctl index * ` are unresponsive until I've killed the process.

I'm running Ubuntu 17.10, which is up-to-date as today. (2018-01-27)

I don't think the index is an issue, even if it was held entirely memory it wouldn't account for half the problem.
$ balooctl indexSize
Actual Size: 6.80 GiB
Expected Size: 5.04 GiB

When the memory usage is high (before the cliff in the attached image):
$ balooctl status
Baloo File Indexer is running
^C

After I kill baloo_file_extractor (after the cliff in the attached image):
$ balooctl status
Baloo File Indexer is running
Indexer state: Indexing file content
Indexed 356513 / 374337 files
Current size of index is 11.08 GiB

Let me know if I can provide any more information.

Thanks!
Comment 3 Michael Heidelbach 2018-02-01 12:21:15 UTC
(In reply to Gaël de Chalendar (aka Kleag) from comment #0)
> 1. Install KDE Neon on a machine with several thousand files, for example a
> development machine

baloo is having a hard time indexing plain text, because there are so many terms to extract. Also the backend database is memory based so I would expect memory consumption to rise during the process.
Please report memory usage when indexing is done. I'm really curious to see that.

>     Baloo File Indexer is running
>     Indexer state: Inactif
>     Indexed 361894 / 1513487 files
This is strange: There are a lot of files left to be indexed, but the indexer itself is idle?
Did you kill it? Probably not. I've encountered this many times. 
Anyway this behaviour definitely is worth scrutinizing. I'll do it when I'm more familiar with baloo's code.

For the time being, occasionally it helps to restart baloo with
$ balooctl stop
ensure baloo_file and baloo_file_extractor are not running
$ balooctl start

>     Current size of index is 6,28 Gio
With an index of that size searching might be a little slow. And your even half-way done :)
Not sure, but I have the feeling baloo wasn't designed for this and you're overburdening it.
I'm just trying to imagine what will happen when you enter 'const' in KRunner/Milou.
Comment 4 Michael Heidelbach 2018-02-01 12:35:08 UTC
(In reply to DDR from comment #2)
Commands such as ` balooctl index
> * ` are unresponsive until I've killed the process.
Please clarify 'unresponsive': Did you have to Ctrl-C?
You did this while indexing was in progress. 
$ balooctl index *
is probably waiting for indexing to finished before queing another batch.
And even then most likely you'll only get a lot of 'indexing done' messages.


> Baloo File Indexer is running
> Indexer state: Indexing file content
> Indexed 356513 / 374337 files
Please keep your cool, let indexing finish in peace. It's nearly done :-)
What kind of files are indexing? See Comment #3.
> 
> Let me know if I can provide any more information.
> 
> Thanks!
Comment 5 DDR 2018-03-07 09:49:32 UTC
OK, so I have just discovered the magic of ls -la /proc/1234/fd, where 1234 is the pid of baloo_file_extractor. 😎

baloo_file_extractor was busy on a 1.5GiB text file, production-aria-tables.sql, and then got stuck on its backup. I added these files to the ignore list, in File Search — System Settings, and the indexer has gotten on with life and is indexing the last few files it needs to. Unfortunately, as the file is a database dump of mlpforums.com, I cannot share it for reproduction due to confidentiality issues. Perhaps a partial dump of the kde bugs database would suffice for that purpose.
Comment 6 DDR 2018-03-07 10:19:01 UTC
> Please report memory usage when indexing is done. I'm really curious to see that.
About 1.1GiB, ~5% of the available system memory. Very reasonable.

> Did you kill it? Probably not. I've encountered this many times.
No, not as of the report. I did shortly after - it was that, or it killed me by swapping anything useful to disk.

> Please clarify 'unresponsive': Did you have to Ctrl-C?
Yes. I think balooctl it was waiting for baloo_file_extractor to provide some information, but the extractor never would. I think it was busy extracting. I don't have a way to ctrl-c file extractor, but I think when I send the equivalent signal it shuts down just fine. ("End Process" in system monitor.)

> With an index of that size searching might be a little slow. And your even half-way done :)
> Not sure, but I have the feeling baloo wasn't designed for this and you're overburdening it.

Searching is still lightning fast. It seems it was designed very well in that regard.
I was definitely overburdening it. I feel it really should have known better than to try to index a tremendous plain-text file, though. It is enthusiastic, it bit off significantly more than it could chew. The actual search index for the forum the database dump was from takes over a week to rebuild on the server, I imagine the more generalised search tool would be absolutely doomed in that endeavour. That, and a week of solid uptime is quite rare for me.


> I'm just trying to imagine what will happen when you enter 'const' in  KRunner/Milou.
Would Dolphin's ctrl-f suffice? It's up to 1158 folders and 115308 files. Somewhat amazingly, although the search results took a few minutes to populate, Dolphin itself is still perfectly responsive and I can scroll through the files just fine. Typing to select a file works both perfectly and instantly. Memory use remaned unremarkably low throughout the whole process, and didn't really change when I exited Dolphin.

All 374579 files have now finished indexing. The current size of the index is 11.08 GiB.
Comment 7 Allan Andersen 2018-09-11 07:11:16 UTC
Same issue here. Really annoying using cpu and lots of memory.

balooctl status
Baloo File Indexer is running
Indexer state: Idle
Indexed 98426 / 157132 files
Current size of index is 7,81 GiB

Process: baloo_file_extractor is using 9.2 GiB memory!

Its a developer machine with several thousand files.

Linux aa-Precision-3510 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Comment 8 Johannes Tiemer 2018-09-27 09:36:27 UTC
I have the same issue on a current Arch Linux (baloo 5.50, kernel 4.18.9), I only noticed it after the last upgrade a week ago, due to the machine having a long uptime. lower versions of baloo ran just fine and I fail to remember making huge changes to the data to be indexed. 

Component/Version: baloo_file_extractor/5.50
Platform: Arch Linux
Kernel: 4.18.9

Issue: After KDE Startup baloo_file_extractor uses ever more RAM until it stalls the machine to a freeze once it uses all available RAM. Not even switching to another tty to kill the process is possible anymore.

Steps to reproduce: Start Plasma session and wait while watching the RAM use applet or htop. It takes around 3-4 minutes after startup to fill what is available of 16GB RAM.

Remedy: Sending SIGTERM end baloo_file_extractor and frees the RAM. baloo doesn't restart it.

Context Info: I found some sources who claim that .vdi files are a problem for baloo, I consequently excluded my VM directory from the search (and some others). It did not change anything about the issue.

[user@machine ~]$ balooctl status
Die Baloo-Dateiindizierung läuft
Indizierungsstatus: Dateiinhalt wird indiziert
91496/91624 Dateien indiziert
Der aktuelle Index hat eine Größe von 31,19 GiB

[user@machine ~]$ balooctl indexSize
Actual Size: 31,19 GiB
Expected Size: 14,32 GiB

           PostingDB:     616,27 MiB    25.946 %
          PositionDB:       1,96 GiB    84.367 %
            DocTerms:       2,37 GiB   101.981 %
    DocFilenameTerms:      11,44 MiB     0.482 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       1,48 MiB     0.062 %
          IdFileName:       8,32 MiB     0.350 %
             DocTime:       3,87 MiB     0.163 %
             DocData:      12,64 MiB     0.532 %
   ContentIndexingDB:      12,00 KiB     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       3,34 MiB     0.141 %

[user@machine ~]$ uname -a
Linux pica 4.18.9-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 19 21:19:17 UTC 2018 x86_64 GNU/Linux

[user@machine ~]$ pikaur -Qi baloo
Name                     : baloo
Version                  : 5.50.0-1
Beschreibung             : A framework for searching and managing metadata
Architektur              : x86_64
URL                      : https://community.kde.org/Frameworks
Lizenzen                 : LGPL
Gruppen                  : kf5
Stellt bereit            : Nichts
Hängt ab von             : kfilemetadata  kidletime  kio  lmdb
Optionale Abhängigkeiten : qt5-declarative: QML bindings [Installiert]
Benötigt von             : baloo-widgets  gwenview  plasma-desktop  plasma-mediacenter
Optional für             : plasma-workspace
In Konflikt mit          : Nichts
Ersetzt                  : Nichts
Installationsgröße       : 2,41 MiB
Packer                   : Antonio Rojas <arojas@archlinux.org>
Erstellt am              : Mo 03 Sep 2018 16:26:53 CEST
Installiert am           : Mo 10 Sep 2018 02:47:54 CEST
Installationsgrund       : Installiert als Abhängigkeit für ein anderes Paket
Installations-Skript     : Nein
Verifiziert durch        : Signatur
Comment 9 Johannes Tiemer 2018-09-27 09:38:34 UTC
D'oh, forgot to mention: CPU load is 100% on one single core until I kill the process.

I'll remember to look into what it's indexing when I boot next time.
Comment 10 Johannes Tiemer 2018-09-28 17:59:34 UTC
I checked. It took baloo_file_extractor 9 minutes (according to uptime) to fill 13GiB RAM, where it pretty much exclusively (as far as I could tell) operated on my Archive disk, which, among others, contains lots of txt/csv-files (large datasets, probably a little below 100GiB) and my email backups from thunderbird which are a mess of many tens of thousands of small files.

I blacklisted a part of it now and will report back once I find out something new about its behaviour.
Comment 11 Johannes Tiemer 2018-09-29 21:37:20 UTC
After excluding the above mentioned folders with lots of small files, baloo stopped its memory eating behavior. Scanning for file numbers and sizes and then warning might be a simple safeguard maybe?
Comment 12 Johannes Tiemer 2018-12-04 12:40:05 UTC
Hey everybody,
since installing the update to version 5.52 on my computer (Arch current), the baloo file indexer began showing unwanted behaviour again. All directories with large text files were blacklisted beforehand.
The behaviour found is as before: RAM usage explodes within roughly minute to fill all of 16GB, since I have no swap, it then freezes my machine by clogging the RAM … playing "nice" with RAM may be a thing too.

What I found with "balooctl monitor"
- it seemed to plainly ignore that it should _not_ index the windows partition that I have mounted into my home for convenience
- it seems to begin expand in RAM while reporting that it is checking for "checking for obsolete index entries"

I let baloo completely recreate its index over a few days when I realised it is misbehaving again. See my above comment for earlier indexSize:
---
$ balooctl indexSize
Actual Size: 32,88 GiB
Expected Size: 22,85 GiB

           PostingDB:       2,31 GiB    81.336 %
          PositionDB:       1,48 GiB    51.905 %
            DocTerms:       3,71 GiB   130.294 %
    DocFilenameTerms:      57,95 MiB     1.989 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       7,63 MiB     0.262 %
          IdFileName:      40,62 MiB     1.394 %
             DocTime:      18,80 MiB     0.645 %
             DocData:      53,23 MiB     1.827 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:      10,35 MiB     0.355 %
---
Comment 13 Savor d'Isavano 2019-10-17 11:13:26 UTC
As of baloo 5.63.0, the issue persists.

Memory consumption increases by ~2MB/s. CPU consumption is also considerable.

Luckily I noticed the CPU fan spinning noisily and disabled baloo before it was too late to save my work (>9GB memory at the time).

See this screencast:
https://vimeo.com/366988108
Comment 14 Adam Fontenot 2021-10-14 06:03:40 UTC
This is still a really common issue. I don't know that I've ever spoken to someone who uses KDE + Baloo with the out of the box settings who hasn't run into it. I mean, just for a starting sample:

https://old.reddit.com/r/kde/comments/j77j16/can_we_please_have_kde_disable_baloo_by_default/
https://old.reddit.com/r/kde/comments/kzdoux/baloo_should_be_suspended_when_the_system_is_in/
https://old.reddit.com/r/kde/comments/lgg0su/how_is_baloo_doing_these_days/
https://old.reddit.com/r/kde/comments/o6w0ly/whats_wrong_with_baloo/
https://old.reddit.com/r/kde/comments/pc4wk1/baloo_file_extr_extreme_cpu_usage/

This is just the first five issues I could find with people talking about this *exact* problem, but the list goes on and on. The oldest complaint in that list is only a year old.

More than a complaint, I have a proposal: it is completely unreasonable for a file indexer to ever make a user's system unusable. Any time it takes baloo_file_extractor more than 30 seconds to pull the text out of a file, or it starts using more than 10% of the user's total RAM, it should be instantly killed and the file blacklisted. Only the file name (not contents) should be available to search results.

Moreover, some kind of heuristic is desperately needed to tell Baloo that a file can't be usefully indexed. Baloo is happy to use a ton of memory and hard disk space to index files that are - for most purposes - random binary data.

Just as an example: I have a PDF that contains no meaningful text at all (it's a plot automatically generated from some technical data). It's only 20 MB. Yet baloo_file_extractor hung on this file for a *long* time, probably more than half an hour, with RAM use up over 1 GB. It continued using 100% of one CPU core despite the fact that I was trying to run a full screen game at the time.
Comment 15 DDR 2021-10-14 06:26:08 UTC
Created attachment 142418 [details]
attachment-8819-0.html

I second this. It's a bit absurd to just run it with no resource limits,
internally or externally.

On Wed., Oct. 13, 2021, 11:04 p.m. Adam Fontenot, <bugzilla_noreply@kde.org>
wrote:

> https://bugs.kde.org/show_bug.cgi?id=380456
>
> Adam Fontenot <adam.m.fontenot+kde@gmail.com> changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|
> |adam.m.fontenot+kde@gmail.c
>                    |                            |om
>
> --- Comment #14 from Adam Fontenot <adam.m.fontenot+kde@gmail.com> ---
> This is still a really common issue. I don't know that I've ever spoken to
> someone who uses KDE + Baloo with the out of the box settings who hasn't
> run
> into it. I mean, just for a starting sample:
>
>
> https://old.reddit.com/r/kde/comments/j77j16/can_we_please_have_kde_disable_baloo_by_default/
>
> https://old.reddit.com/r/kde/comments/kzdoux/baloo_should_be_suspended_when_the_system_is_in/
> https://old.reddit.com/r/kde/comments/lgg0su/how_is_baloo_doing_these_days/
> https://old.reddit.com/r/kde/comments/o6w0ly/whats_wrong_with_baloo/
>
> https://old.reddit.com/r/kde/comments/pc4wk1/baloo_file_extr_extreme_cpu_usage/
>
> This is just the first five issues I could find with people talking about
> this
> *exact* problem, but the list goes on and on. The oldest complaint in that
> list
> is only a year old.
>
> More than a complaint, I have a proposal: it is completely unreasonable
> for a
> file indexer to ever make a user's system unusable. Any time it takes
> baloo_file_extractor more than 30 seconds to pull the text out of a file,
> or it
> starts using more than 10% of the user's total RAM, it should be instantly
> killed and the file blacklisted. Only the file name (not contents) should
> be
> available to search results.
>
> Moreover, some kind of heuristic is desperately needed to tell Baloo that a
> file can't be usefully indexed. Baloo is happy to use a ton of memory and
> hard
> disk space to index files that are - for most purposes - random binary
> data.
>
> Just as an example: I have a PDF that contains no meaningful text at all
> (it's
> a plot automatically generated from some technical data). It's only 20 MB.
> Yet
> baloo_file_extractor hung on this file for a *long* time, probably more
> than
> half an hour, with RAM use up over 1 GB. It continued using 100% of one CPU
> core despite the fact that I was trying to run a full screen game at the
> time.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 16 Adam Fontenot 2021-11-16 10:44:42 UTC
I actually filed an upstream bug with Poppler for its handling of the specific PDF file I was seeing issues with. https://gitlab.freedesktop.org/poppler/poppler/-/issues/1173

Surprisingly, the Poppler devs say there's nothing wrong with Poppler here (despite the fact that their pdftotext tool hangs for over an hour on this file). That's because the R script which generated it is apparently using the "I" character repeatedly as part of a graph. I don't know why R does that, but it does.

Quoting the dev response:

> whether this bug is fixed or not baloo needs to understand that extracting the 
> text of a pdf file can take forever, and thus give up after X seconds/minutes

Obviously this is not going to correspond to everyone's issues, but it's an interesting example of the point I made:

> it is completely unreasonable for a file indexer to ever make a user's system 
> unusable. Any time it takes baloo_file_extractor more than 30 seconds to pull 
> the text out of a file, or it starts using more than 10% of the user's total 
> RAM, it should be instantly killed and the file blacklisted. Only the file name 
> (not contents) should be available to search results.

So in general, while there *may* be specific bugs with Baloo that need fixing or some crazy files that perhaps "shouldn't" exist, the probable cause of this problem for *most* users is that Baloo simply doesn't give up on trying to index a file when it really, really should.
Comment 17 tagwerk19 2021-11-16 12:36:12 UTC
(In reply to Adam Fontenot from comment #16)
> ... it is completely unreasonable for a file indexer to ever make a user's system 
> unusable. Any time it takes baloo_file_extractor more than 30 seconds to pull 
> the text out of a file, or it starts using more than 10% of the user's total 
> RAM, it should be instantly killed and the file blacklisted. Only the file name 
> (not contents) should be available to search results ...
OOoooo. Ouch!

If you look at htop, you'll see that baloo_file and baloo_file_extractor run with minimum priority. They'll yield to nearly everything that wants a CPU. They should take all the time they need without annoying anything else....

Memory usage is different, baloo "memory maps" the index and pulls pages from disc to memory as needed, they'll be "forgotten" again if the RAM is needed (and the pages have not been modified). You might see that baloo_file / baloo_file_extractor use a lot of memory but that can be "just cache".

The kicker is when indexing and you're building a *large* transaction, that might take a lot of memory (possibly, alas, stretching to swap). If you kill the process before the commit is done, you're condemning yourself to repeat the work.  On a system with Out Of Memory (OOM) protections, you might hit this.

You can see a little of what's happening (the switching between reading the source files and writing the updates to the index) with iotop.

> ... Surprisingly, the Poppler devs say there's nothing wrong with Poppler here
> (despite the fact that their pdftotext tool hangs for over an hour on this
> file). That's because the R script which generated it is apparently using
> the "I" character repeatedly as part of a graph. I don't know why R does
> that, but it does ...
I'm tempted to say that if this is a application generated file with little/no human readable information in it (that happens to be a PDF) it would make sense to have an application specific mimetype for it. Then that can be added to baloo's "exclude filters" list. I suspect though that if the file is generated by a script, that might not be possible.

> So in general, while there *may* be specific bugs with Baloo that need
> fixing or some crazy files that perhaps "shouldn't" exist, the probable
> cause of this problem for *most* users is that Baloo simply doesn't give up
> on trying to index a file when it really, really should.
Baloo does have a mechanism for flagging files as "failed" - "balooctl failed" will list them. I think that needs more love...
Comment 18 Adam Fontenot 2021-11-16 23:23:00 UTC
(In reply to tagwerk19 from comment #17)
> If you look at htop, you'll see that baloo_file and baloo_file_extractor run
> with minimum priority. They'll yield to nearly everything that wants a CPU.
> They should take all the time they need without annoying anything else....
Hmm, even assuming this is true, does the process suspend if the user is on battery? An otherwise idle system consuming 100% of a core for hours on end is sure to annoy the user even if it doesn't interfere with other processes.

I'd also point out that I discovered this issue (after several years of being vaguely aware of "baloo problems") when I saw stuttering in a full screen game. Alt-tabbing to htop showed baloo_file_extractor at 100%. Baloo may in theory yield to other processes, but it didn't prevent me from seeing issues.

> Memory usage is different, baloo "memory maps" the index and pulls pages
> from disc to memory as needed, they'll be "forgotten" again if the RAM is
> needed (and the pages have not been modified). You might see that baloo_file
> / baloo_file_extractor use a lot of memory but that can be "just cache".
If I'm not mistaken, that's just for internal Baloo memory usage, right? In my case, baloo_file_extractor is calling out to an external library (poppler), and that library is consuming an endlessly growing amount of memory (from 1-3 GB before I've killed it). It's probably safe to say that this memory usage is in the form of anonymous mappings which can't be reclaimed. Baloo *must* take that into account and kill the extractor process if it begins affecting system resources.

> I'm tempted to say that if this is a application generated file with
> little/no human readable information in it (that happens to be a PDF) it
> would make sense to have an application specific mimetype for it. Then that
> can be added to baloo's "exclude filters" list. I suspect though that if the
> file is generated by a script, that might not be possible.
In this case, it's a graph of some scientific data. Plotting scientific data to PDF or SVG (which both can have extractable text) is very common. In any case, it shouldn't be on the user to determine which files are causing problems (I had to use strace!) and exclude them. A file indexer should "just work".
Comment 19 tagwerk19 2021-11-17 09:21:05 UTC
(In reply to Adam Fontenot from comment #18)
> Hmm, even assuming this is true, does the process suspend if the user is on
> battery? An otherwise idle system consuming 100% of a core for hours on end
> is sure to annoy the user even if it doesn't interfere with other processes.
I'm pretty confident about the CPU priority and I know that baloo is aware that it is on battery (and avoids content indexing). What happens in your case, I'm afraid I don't know.

> I'd also point out that I discovered this issue (after several years of
> being vaguely aware of "baloo problems") when I saw stuttering in a full
> screen game. 
I would still suspect memory use rather than CPU as the underlying reason. There are situations where baloo is building a large transaction and requires lots of memory, there's a summary starting https://bugs.kde.org/show_bug.cgi?id=400704#c31. It's quite possible for systems to "hit the mud" in these cases.

> If I'm not mistaken, that's just for internal Baloo memory usage, right?
I'd say yes, the cases I've looked at were when indexing large text files and writing the results to the index.

> ... baloo_file_extractor is calling out to an external library
> (poppler), and that library is consuming an endlessly growing amount of
> memory (from 1-3 GB before I've killed it). It's probably safe to say that
> this memory usage is in the form of anonymous mappings which can't be
> reclaimed. Baloo *must* take that into account and kill the extractor
> process if it begins affecting system resources.
That's a *lot* of memory for a "pdf to text" conversion 8-]

You see the baloo_file_extractor RAM usage go up during the extraction and not come down when it is finished?

> In this case, it's a graph of some scientific data. Plotting scientific data
> to PDF or SVG (which both can have extractable text) is very common. In any
> case, it shouldn't be on the user to determine which files are causing
> problems (I had to use strace!) and exclude them.
Understood.

Could you see the culprit file in "System Settings > Search" (recent releases of baloo show the progress of the indexing there) or when running "balooctl monitor"?

In your use case, you could save your plots to a folder that was not indexed. Yes, I know, it's shouldn't be up to the user but in this case as a workround...

>  A file indexer should "just work".
Yup,  I think there's general agreement on that :-)
Comment 20 Adam Fontenot 2021-11-18 01:00:01 UTC
(In reply to tagwerk19 from comment #19)
> I would still suspect memory use rather than CPU as the underlying reason.
It's quite possible that you're right about that. I do know the game is sensitive to available memory, possibly because it runs on the internal Intel graphics chip.

> > ... baloo_file_extractor is calling out to an external library
> > (poppler), and that library is consuming an endlessly growing amount of
> > memory (from 1-3 GB before I've killed it). It's probably safe to say that
> > this memory usage is in the form of anonymous mappings which can't be
> > reclaimed. Baloo *must* take that into account and kill the extractor
> > process if it begins affecting system resources.
> That's a *lot* of memory for a "pdf to text" conversion 8-]
Yes, especially for a random 20 MB PDF I didn't even remember existed.

> You see the baloo_file_extractor RAM usage go up during the extraction and
> not come down when it is finished?
I have never been able to leave it for long enough to finish extracting from the file. It's possible I'd even get an out of RAM hang before then. The Poppler devs estimate at least 7 GB of RAM would be needed to extract text from this file. I even tested their pdftotext command on a system with plenty of RAM, and even then the issue is that it simply takes too long. I've left it running for over an hour on this one file before, and never seen it complete.

Moreover, they insist that it's not a bug on their end. The file, in their view, is pathological and the only reasonable solution is not to try to extract text from it. I think I understand that perspective: it's not every day that you come across a PDF with millions of "words" on a single page. So it's on Baloo to bail out if the process takes too long or consumes too much RAM. Here's the bug report I filed with them if you want to follow that conversation: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1173

> Could you see the culprit file in "System Settings > Search" (recent
> releases of baloo show the progress of the indexing there) or when running
> "balooctl monitor"?
Unfortunately, I don't remember. I do remember using lsof and friends to check that it was the only file Baloo had open. I may not have realized at the time that that feature had been added to the Baloo KCM.
Comment 21 tagwerk19 2021-11-23 13:40:25 UTC
Created attachment 143869 [details]
pdftotext results from https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC

(In reply to Adam Fontenot from comment #20)
> ... The file, in their view, is pathological ...
Applying a modicum of patience, running:

    nice -19 pdftotext QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf

took 37 hours on a machine with 16GB memory 8-]

The process gradually ate memory, reaching 10 GB. There wasn't an obvious impact on performance - but I would expect you'd see that bite when reaching the limits/starting to swap.

Attaching the output file - just in case anyone else wants to see the result.

When moving the source file to an indexed folder it was picked up by baloo and indexed by baloo_file_extractor. Similarly 37hrs and 10.1 GB.

Alas wasn't quick enough to notice what happened to the baloo_file_extractor memory usage when the indexing finished - the process terminated (and released memory) when it had nothing more to do

The details of the index records:

    $ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
    1546b20000fc01 64513 1394354 Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf [/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf]
            Mtime: 1637335759 2021-11-19T16:29:19
            Ctime: 1637335813 2021-11-19T16:30:13
            Cached properties:
                    Title: R Graphics Output
                    Document Generated By: R 3.6.0
                    Page Count: 1
                    Creation Date: 2019-09-13T11:01:30.000Z

    Internal Info
    Terms: 0 100000 150000 200000 50000 Mapplication Mpdf T5 X15-graphics X15-output X15-r X17-3.6.0 X17-r X18-1 X24-2019-09-13T11:01:30Z a1 a2 b1 b2 c graphics output qagr qchr qkel qpal r vcf − ●
    File Name Terms: Fpdf Fqmvqwhpuqke7retn5f9tisea7
    XAttr Terms:
    generator: 3.6.0 r
    pageCount: 1
    title: graphics output r
    creationDate: 2019-09-13T11:01:30Z

and...

    $ balooshow -x Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt
    140a610000fc01 64513 1313377 Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt [/home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.txt]
            Mtime: 1637519014 2021-11-21T19:23:34
            Ctime: 1637519014 2021-11-21T19:23:34
            Cached properties:
                    Line Count: 4352

    Internal Info
    Terms: 0 100000 150000 200000 50000 Mplain Mtext T5 T8 X20-4352 a1 a2 b1 b2 c qagr qchr qkel qpal vcf − ●
    File Name Terms: Fqmvqwhpuqke7retn5f9tisea7 Ftxt
    XAttr Terms:
    lineCount: 4352

So, for this instance, not a lot of indexable text but the metadata was recognised (in the PDF, it was not extracted to the text) and it was possible to search for the title:

    $ baloosearch "R Graphics Output"

or...

    $ baloosearch title:"R Graphics Output"
    /home/test/Downloads/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf

I think with enough RAM and patience baloo can cope with even this pathological test case but the the requirement definitely _is_ "enough Ram and patience". It would certainly make sense to be able to say to baloo_file_extractor "give up after 10 minutes" and flag the file as failed.

I'll update Bug 400704, which has become a collection point for these misbehavin' reports. See:

    https://bugs.kde.org/show_bug.cgi?id=400704#c31

and onwards.
Comment 22 Stefan Brüns 2024-03-27 15:13:06 UTC
(In reply to tagwerk19 from comment #21)
> Created attachment 143869 [details]
> pdftotext results from
> https://ipfs.io/ipfs/QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC
> 
> (In reply to Adam Fontenot from comment #20)
> > ... The file, in their view, is pathological ...
> Applying a modicum of patience, running:
> 
>     nice -19 pdftotext QmVqWhPuQkE7reTN5F9TiSeA75z62VNaZUSFZz3FdWTLbC.pdf
> 
> took 37 hours on a machine with 16GB memory 8-]
> 
> The process gradually ate memory, reaching 10 GB. There wasn't an obvious
> impact on performance - but I would expect you'd see that bite when reaching
> the limits/starting to swap.

The long runtime is caused by some algorithmically bad implementation, i.e. O(n^2) were e.g. O(n log n) is sufficient. The huge memory footprint is caused by some problematic data arrangement and too greedy pre/overallocation.

I have filed two MRs [1],[2] for poppler, with both applied the extractions runs in ~50 seconds on my 3 year old laptop, with a peak memory consumption of 1.8 GByte.

[1] https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/1514  
[2] https://gitlab.freedesktop.org/poppler/poppler/-/merge_requests/1515