Bug 332421 - Baloo file extractor ate all my PC's RAM
Summary: Baloo file extractor ate all my PC's RAM
Status: RESOLVED FIXED
Alias: None
Product: Baloo
Classification: Unmaintained
Component: General (show other bugs)
Version: unspecified
Platform: Kubuntu Linux
: NOR grave
Target Milestone: ---
Assignee: Pinak Ahuja
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-21 23:50 UTC by Kyrylo Bohdanenko
Modified: 2018-05-12 12:28 UTC (History)
22 users (show)

See Also:
Latest Commit:
Version Fixed In: 4.13.1
Sentry Crash Report:


Attachments
baloo_file_extractor eating my RAM (108.68 KB, image/png)
2014-03-21 23:50 UTC, Kyrylo Bohdanenko
Details
Baloo using 11gb Ram (119.53 KB, image/png)
2014-04-18 10:19 UTC, rick_airtime
Details
massif output (798.96 KB, application/octet-stream)
2014-04-23 10:39 UTC, Matt
Details
Little tool to show which files are currently being indexed (536 bytes, text/x-python)
2014-04-26 09:26 UTC, Michael Tils
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kyrylo Bohdanenko 2014-03-21 23:50:15 UTC
It just consumes more and more RAM on my PC.

Reproducible: Always

Steps to Reproduce:
1. Just wait until baloo_file_extractor works and watch it's application memory increases
Actual Results:  
After a few minutes baloo_file_extractor process consumes about 8 Gigs of RAM

Expected Results:  
Baloo should not consume so much memory

See attached screenshot. I'm using Kubuntu 14.04 (x86_64). I can provide any additional information if required.

My report may be related to this one: https://bugs.kde.org/show_bug.cgi?id=331936

But I do not use KMail (with Akonadi) at all. I use Thuderbird (if that matters)
Comment 1 Kyrylo Bohdanenko 2014-03-21 23:50:48 UTC
Created attachment 85679 [details]
baloo_file_extractor eating my RAM
Comment 2 Vishesh Handa 2014-03-22 02:36:21 UTC
These issues are slightly hard to reproduce as they depend a lot on the file being indexed. You'll need to figure out which file is causing this excessive memory usage. Here is how you can do so -

$ ps aux | grep baloo_file_extractor

This will give you list on numbers. Each of these numbers represents a file. You can find out exactly which files those are by running balooshow -

$ balooshow 4 56 65 43 .. other numbers

Try running baloo_file_extractor on each of these files individually and seeing if you can reproduce it or is it only with all those numbers.

In the future, I'll try to make a tool to automatically do this for you and log memory usage, but till then you'll have to do it manually.
Comment 3 Kyrylo Bohdanenko 2014-03-23 10:18:06 UTC
It seems like baloo_file_extractor behaves itself badly with (pretty big) *.vdi files. I have:
30G     /windows/D/Work/VMs/Kubuntu Raring/Kubuntu Raring.vdi
on NTFS:
/dev/sda7 on /windows/D type fuseblk (rw,nosuid,nodev,allow_other,default_permissions,blksize=4096)

So when I'm running baloo_file_extractor on that file it causes huge CPU load and starts to eat RAM pretty fast.

I can share that *.vdi with you (if required)
Comment 4 Vishesh Handa 2014-03-23 11:56:53 UTC
Interesting. I'll borrow some large vdi files from a colleague. The 4gb one I have does not cause any problems. 

Weirdly enough, we do not have any indexers for vdi files so they should be ignored. Could you possibly run massif on the file?

$ valgrind --tool=massif baloo_file_extractor theFile.vdi

It will output a file which you can then upload.
Comment 5 Roman 2014-03-27 13:05:58 UTC
In my case baloo_file_extractor eats ~2 G of RAM hanging my system on a ~4G file that contains lots of floating-point numeric data (stored as ascii text, many of very long lines ~200000 characters each)
Comment 6 rick_airtime 2014-04-18 09:38:27 UTC
I have the same problem. baloo seem to use 100% of one core and uses up to about 10gb of ram. Stops for about a second drops everything out of the ram and starts building up again

$ ps aux | grep baloo_file_extractor
rick      4934  100 38.1 6609872 6253028 ?     RN   10:16   2:40 /usr/bin/baloo_file_extractor 19009 19008 19007 19006 19005 19004 19003 19002 19001 19000
rick      4979  0.0  0.0  11752   924 pts/2    S+   10:19   0:00 grep --color=auto baloo_file_extractor
Comment 7 rick_airtime 2014-04-18 10:19:42 UTC
Created attachment 86146 [details]
Baloo using 11gb Ram
Comment 8 boblovgren55 2014-04-19 12:38:10 UTC
I'm not having an issue with Baloo file extractor using a lot of ram, but I am noticing it shoot up to 100% of 1 core constantly, and a lot of unnecessary disk activity that wasn't happening with KDE 4.11.5. It's surprising there's no GUI switch to turn Baloo search off; I know there are other ways, but the way it's currently implemented is that a new user may not know how to turn it off, and it's being forced down their throat if they upgrade. That in itself is not appropriate. The developers should have included the functionality to turn Baloo off via the GUI, especially when considering the fact that my HD is going crazy and it's eating away at the CPU.
Comment 9 boblovgren55 2014-04-19 13:01:37 UTC
As a test I added my home folder to the "Desktop Search" exclusion list, and baloo file cleaner went crazy on my HD and would not stop accessing it. Then I removed the home folder from the execution list and added it back, and baloo is at a stand still. Baloo seems fairly broken to me...my system is now worse than when Nepomuk was on it. Thanks a lot KDE team.
Comment 10 boblovgren55 2014-04-19 13:20:51 UTC
I think I tracked it down to BOINC causing Baloo to go completely bonkers. Now, not everybody is going to figure out how to add their BOINC data folder to the exclusion list, so is there a way to make Baloo play nice with BOINC without burning somebody's hard drive out in 2 weeks?
Comment 11 André M 2014-04-20 18:00:35 UTC
I dont' use or even have BOINC installed. Thought the problem was related with VDI images, but even excluding the mountpoint where images sits, Baloo still eat all my RAM, until the system starts swapping, and become unusable. Killing baloo processes or logout didn't help, the only way the take back my RAM is restarting.
Comment 12 Matt 2014-04-23 09:08:26 UTC
baloo_file_extractor consumes all my ram with the file:
X-Plane 10 Demo/Resources/default scenery/default apt dat/Earth nav data/apt.dat
You can get this file by installing the X-Plane 10 Demo from http://www.x-plane.com/downloads/landing/ (my X-Plane version is 10.25).
Comment 13 Marcel Martin 2014-04-23 09:29:33 UTC
I've upgraded to Kubuntu 14.04 two days ago and have the same problem: Memory usage of baloo_file_extractor grows up to about 11 GB, then the process disappears and the next process tries to index the same file all over again. I'm adding my info here since the problematic file is publicly available.

The file it is trying to index is the DNA sequence of the human genome, available here:
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
The problem occurred with the uncompressed file, which has about 3 GB. The file is pure ASCII, so I guess the problem is the same as the one Roman has seen since his data is also just ASCII.

Of course indexing of large text files should be fixed, but perhaps it's also an idea to not index ASCII text that is larger than some threshold, perhaps 100 MB or so. And at least for FASTA files, it doesn't make sense to have things like charactor, word and line count.

While I appreciate a lot the effort spent on this and would really like to use desktop search, I also think that bugs like these are a good reason for having a simple way to temporarily disable indexing, at least in this stage of development.
Comment 14 Vishesh Handa 2014-04-23 09:39:32 UTC
Agreed. I'll add a 20-30 mb limit on pure text files.

I've added a page over here which documents the config values used by Baloo. This can also be used to disable it.

http://community.kde.org/Baloo/Configuration
Comment 15 boblovgren55 2014-04-23 09:56:30 UTC
Using Baloo search, it's coming up with completely irrelevant search results, or doesn't find the file on my hard drive at all. This piece of software is very broken; it's not even beta, why was it released? It doesn't even work.
Comment 16 Vishesh Handa 2014-04-23 10:11:52 UTC
Guys, if it's not working for you please try to provide accurate debugging info. 

http://community.kde.org/Baloo/Debugging
Comment 17 Matt 2014-04-23 10:39:03 UTC
Created attachment 86228 [details]
massif output

ms_print file for the extractor running on  X-Plane 10 Demo/Resources/default scenery/default apt dat/Earth nav data/apt.dat.
Comment 18 Matt 2014-04-23 10:41:52 UTC
I have attached a massif ms_print file for the extractor running on  X-Plane 10 Demo/Resources/default scenery/default apt dat/Earth nav data/apt.dat.
apt.dat is a 2.5 million line text file (102MB) and indexing it is too much for my system with 2GB of RAM.
Comment 19 Matt 2014-04-23 12:20:30 UTC
I have managed to stop baloo_file_extractor using up all my ram by limiting the number of lines it will index in any one file.
I changed kfilemetadata's plaintextextractor.cpp lines 54-55 to:
//Stop indexing after the first 500000 lines and restrict lines to 10000 chars.
    int MaxLines = 500000;
    int MaxCharsPerLine = 10000;    
    while (!ts.atEnd() && lines < MaxLines) {
        QString str = ts.readLine(MaxCharsPerLine);

I have also limited the number of chars in a line to prevent large lines from using up too much ram (I do not know if this is necessary).
These values are currently arbitrary and some testing needs to be done with different files (I have not tested with files with long lines at all!).
Limiting the number of chars in a line in this way will result in words being broken up, and as I am not familiar with Xapian or Baloo I am not sure what the consequences of this will be but I imagine they are not a serious as crippling the system by using all its RAM.

This method has the advantage of at least indexing some of the file. A better solution would involve indexing the whole file (maybe by breaking the indexing into parts?).
I will try to do more testing If I have time.
Comment 20 Matt 2014-04-23 20:08:55 UTC
If I have done this right, this code limits the indexing of a file to 32M characters and if it stops at that limit does not report incorrect word and line counts:

void PlainTextExtractor::extract(ExtractionResult* result)
{
    QFile file(result->inputUrl());

    if (!file.open(QIODevice::ReadOnly | QIODevice::Text)) {
        return;
    }

    
    int lines = 0;
    int words = 0;

    QRegExp wordsRegex("\\b\\w+\\b");

    QTextStream ts(&file);
    //Stop indexing after the first 32000000 chars.
    long int maxChars = 32000000;
    long int charsRead = 0;

    while (!ts.atEnd()) {
        QString str = ts.readLine(maxChars-charsRead);	//Limit the readLine() to maxChars-charsRead chars (the number of chars left before hitting maximum).
	charsRead += str.length();			//Count how many chars we have read.
// 	qDebug() << QString("Line read %1 chars, Total %2").arg(QString::number(str.length()), QString::number(charsRead)); 
        result->append(str);

        lines += 1;
        words += str.count(wordsRegex);
	if (charsRead >= maxChars)
	{  
// 	  qDebug() << QString("Abandoning having read %1 chars").arg(QString::number(charsRead)); 
          result->addType(Type::Text);
	  return; //Do not continue with this file and do not use the WordCount or LineCount.
	}
    }

    result->add(Property::WordCount, words);
    result->add(Property::LineCount, lines);
    result->addType(Type::Text);

    return;
}
Comment 21 jmaspons 2014-04-24 11:32:03 UTC
Same problem here with large text files (e.g. http://birdtree.org/bird-tree/archives/Stage2/EricsonStage2_0001_1000.zip after extracting it, around 400Mb). I'm using Kubuntu 14.04
Comment 22 Vishesh Handa 2014-04-25 15:31:10 UTC
Confirmed.

I'll either use the patches added or work on something else. I want to do some more tests.
Comment 23 Michael Tils 2014-04-26 09:25:11 UTC
Same here with large text files. I added a little python script to show the current indexed files ba baloo. (Uses ps and balooshow)
I tried to exclude files sql files by appending ",*.sql,*.json,*.log,*log" in ~/.kde4/share/config/baloofilerc but that didnt work (after re-login).
I really like milou and the work you putted in that, but at the moment it eats all my ram and swap. It uses nearly constantly between 3 and 8GB RAM.
So I really can't use my computer anymore until I kill baloo_file_extractor...
Sadly I cannot code in c++, only python and php, otherwise I could probably fix that. Perhaps the fix above will fix it.
The thing that makes me wonder is how much memory it really uses. My sql file has 960MB. If I use this little python script I added, I can see thats the only big file baloo is indexing. The other files are comparatively small. If I open this sql file in kwrite, it uses 2GB of RAM. None of the lines in this sql file is so large that readline() would eat too much RAM. So I am not shure this is the only problem here. Sadly I cannot share this SQL-file with you. Its full of sensitive data.
Comment 24 Michael Tils 2014-04-26 09:26:17 UTC
Created attachment 86275 [details]
Little tool to show which files are currently being indexed
Comment 25 jamese 2014-05-05 01:45:40 UTC
via balooshow, I could see the extractor causing CPU issues for me on a large log files (access logs, debug logs for development). Similar to Comment 21.

Something like this should be shipped turned off with a prompt to turn it on if the user wants it and an easy way to turn it off when it causes an unworkable system.

On Kubuntu, the settings file is:

~/.kde/share/config/baloorc

And I followed the instructions here to turn it off, then killed the process manually.
http://community.kde.org/Baloo/Configuration
Comment 26 Vishesh Handa 2014-05-08 15:54:11 UTC
Git commit a572da64129d665cbee9bedd44075327aa03ed5e by Vishesh Handa.
Committed on 08/05/2014 at 15:44.
Pushed by vhanda into branch 'KDE/4.13'.

Do not index everything whose mimetype is text/plain

We do not index the file if -
* It does not end in a .txt
* Its size is > 50 mb

We do not seem to hande large text files very well.

This is a temporary solution in order to avoid the extractor taking too
long to index a file and never finishing.
FIXED-IN: 4.13.1

M  +17   -1    src/file/extractor/app.cpp

http://commits.kde.org/baloo/a572da64129d665cbee9bedd44075327aa03ed5e
Comment 27 Christian Mollekopf 2014-05-10 15:02:41 UTC
Git commit 44975bd11c4c97be13a117d5533a55e6bafaccbd by Christian Mollekopf, on behalf of Vishesh Handa.
Committed on 08/05/2014 at 15:44.
Pushed by cmollekopf into branch 'dev/scheduler'.

Do not index everything whose mimetype is text/plain

We do not index the file if -
* It does not end in a .txt
* Its size is > 50 mb

We do not seem to hande large text files very well.

This is a temporary solution in order to avoid the extractor taking too
long to index a file and never finishing.
FIXED-IN: 4.13.1

M  +17   -1    src/file/extractor/app.cpp

http://commits.kde.org/baloo/44975bd11c4c97be13a117d5533a55e6bafaccbd
Comment 28 Martin Steigerwald 2014-05-14 20:40:19 UTC
The testcase I made for another bug report shows that appending to triggers high I/O write load in baloo_file_indexer maybe this also happens with overwriting parts of larger files[1].  I didn´t monitor RAM usage, but the bash script could also be changed to create larger files.

I think it would also be possible to simulate some virtualbox testcase with a flexible I/O tester job, having fio overwrite parts of large files with some random access I/O pattern.

[1] Bug 333655 - Baloo indexing I/O introduces serious noticable delays 
https://bugs.kde.org/show_bug.cgi?id=333655#c36
Comment 29 André M 2014-05-15 14:57:50 UTC
Problem persists here.
For now, seems like memory usage goes in orbit (10G+ of my 12G PC, and raising) indexing some big source code folders, like qt5 and android.
Please, do not close a bug without FIX confirmation of users involved.
Thanks.
Comment 30 Benjamin Hodgetts 2014-06-03 08:18:00 UTC
Same here, came in this morning to a completely unresponsive machine which, being a work machine with lots of things that I need still open wasn't great. After 5 minutes I managed to get to top which showed baloo_file_extractor using over half my RAM and killing my CPU.

bhodget+   759  0.0  0.1 378508  4168 ?        SNl  Jun02   0:02 /usr/bin/baloo_file
bhodget+   781  0.0  0.0 309124  3044 ?        SN   Jun02   0:00 /usr/bin/akonadi_baloo_indexer --identifier akonadi_baloo_indexer
bhodget+ 20255 73.8 58.6 2715192 2316252 ?     DN   09:06   1:20 /usr/bin/baloo_file_extractor 51105 51103 51102 51101 51100 51099 51098 51097 51096 51095 51094 51093 51092 51091 51089 51088 51087 51086 51085 51084

Machine was heavily into SWAP and was pretty much a dead-weight by this point.
Comment 31 Vishesh Handa 2014-06-10 13:59:21 UTC
@Andre @Benjamin: I know this sounds stupid, but could you please check that you are on 4.13.1?
Comment 32 André M 2014-06-10 14:02:32 UTC
(In reply to comment #31)
> @Andre @Benjamin: I know this sounds stupid, but could you please check that
> you are on 4.13.1?

I'm sure I was in 4.13.1. Still hadn't tested with 4.13.2, but will do as soon as I can take off my computer from production.
Comment 33 Vishesh Handa 2014-06-11 09:21:49 UTC
(In reply to comment #32)
> (In reply to comment #31)
> > @Andre @Benjamin: I know this sounds stupid, but could you please check that
> > you are on 4.13.1?
> 
> I'm sure I was in 4.13.1. Still hadn't tested with 4.13.2, but will do as
> soon as I can take off my computer from production.

No. 13.1 is fine. If it is still causing problems for you then please follow the guide over here - http://community.kde.org/Baloo/Debugging - and try to identify the file which may be causing these huge memory spikes. The comments on this report seem to indicate that text files were the problem, this should now be fixed.
Comment 34 André M 2014-06-11 18:18:02 UTC
Just tested with 4.13.2 and kernel 3.15, and problem is gone. 
As I said before, with (and only with) Baloo activated, my memory usage was going very high, even starting swap with 12G of RAM, but htop/top/ps didn't showed any high memory (nor even cache)  usage from any process. 
So, I'm convinced that the issue was related to some problems with Btrfs (used in all partitions). Maybe kernel memory not being properly freed after baloo opened, indexed and closed files. But, as you guys should imagine, it's a very hard to debug this kind of issue.
Thanks very much, and keep the great work. I'm loving the speed of baloo, anxious to see all semantic (folders/files, etc) desktop features back again.
Comment 35 Michal Piotrowski 2014-08-29 11:16:52 UTC
I think the bug should be reopened. I use currently KDE 4.3.13 and the bug still exists. In my case Baloo stuck on a large text log file. The size of file is 3969425301 (almost 4 GB). Maybe there not large enough data type for checking the file size?
Comment 36 Mircea Kitsune 2014-08-29 11:41:03 UTC
This might be slightly outside the scope of this bug report. But I really really wish the KDE team would disable Baloo by default. I know file indexing is seen as a necessity for any modern OS, but IMO it's more important to keep something this unstable off. File indexing is probably the #1 reason why KDE has a reputation for being slow and bloated and unusable on machines with low resources.
Comment 37 Vishesh Handa 2014-09-09 12:51:32 UTC
@Michal Piotrowski: Could you please open another bug report about the mimetype 'text/log' please? It'll be easier to track it over there. Otherwise I can reopen the bug.

@Mircea: You might just be right. Maybe not on by default, or better defaults where only your Documents/Music/Images/etc are indexed by default. Perhaps. I have been thinking about this, but we need more data to actually take a call. Plus, distros are always free to ship with it disabled by default.
Comment 38 Michal Piotrowski 2014-09-09 12:56:27 UTC
(In reply to Vishesh Handa from comment #37)
> @Michal Piotrowski: Could you please open another bug report about the
> mimetype 'text/log' please? It'll be easier to track it over there.
> Otherwise I can reopen the bug.

I have checked that command 'file --mime-type archive-20140508-0700.log' returns 'text/plain'. So should I create new bug about 'text/log' mime type or should we leave it in this bug?
Comment 39 Michal Piotrowski 2014-09-09 12:59:02 UTC
I have decided to open a new bug, because I've checked that Dolphin reports this file as a prgoram log type.
Comment 40 Mykola Krachkovsky 2015-11-26 08:02:21 UTC
It looks like bug have returned. baloo_file_extractor eats a lot of memory (a few GBs) and system becomes unresponsible until that process killed, then all starts again. I have to chmod -x /usr/bin/baloo_file_extractor to make my system usable.
Comment 41 Mircea Kitsune 2015-11-26 11:51:00 UTC
Baloo eating up memory has always been a major problem (even back when it was called Akonadi). It's not always Baloo itself that does it... reading and writing so much to the hard drive causes a lot of caching, which you can clear with the root command "echo 3 > /proc/sys/vm/drop_caches". For this reason I always suggested that Baloo is disabled by default, but the KDE team doesn't share my opinion on this.
Comment 42 John Andersen 2016-01-16 05:31:44 UTC
Apparently the hack installed back in August of 2014 to prevent indexing anything but .txt file under 50 Meg is still resident in some packages of extractor/app.cpp:

 if (mimetype == QLatin1String("text/plain")) {
+ if (!url.endsWith(".txt")) {
+ mimetype.clear();
+ }
+
+ QFileInfo fileInfo(url);
+ if (fileInfo.size() >= 50 * 1024 * 1024 ) {
+ mimetype.clear();
+ }

See: http://webcache.googleusercontent.com/search?q=cache:LUTPrh1zmZ8J:r.git.net/kde-commits/2014-05/msg02993.html+&cd=4&hl=en&ct=clnk&gl=us

I would like to index a whole boatload of source code, but because it  does not have a file extension of .txt the extractor fails, so then Baloo gets nothing to index.

It works fine on KDE4, but on Plasma 5  version 15.12.0 (Manjaro / Arch)  I can't index any plain text files that do not end with an extension of .txt.
Comment 43 John Andersen 2016-02-04 01:43:23 UTC
Replacing dead link in above reply:

http://osdir.com/ml/kde-commits/2014-05/msg02993.html

This is becoming a very debilitating limitation for me.
Comment 44 Brendon Higgins 2016-07-08 03:46:17 UTC
Does the "avoid indexing large text files" hack in extractor/app.cpp (which in master has a 10 MB limit) still work if the text file starts small but happens to rapidly grow large from another process at the same time it is being indexed? I had a simulation running that appended (a lot of) numeric data to an initially empty text file, and by my observation it seemed like baloo blew up on it. I'm unfamiliar with baloo's code, but taking a glance at it (particularly app.cpp) suggests to me that this scenario might be possible.

On a separate machine, and maybe a separate issue, I noticed baloo_file_extractor running for a long time with significant RAM usage (>2GB) that did not decline until it had apparently finished indexing a few thousand files that I had recently changed. Why would it need to hold so much RAM as it goes from one file to the next? In the source I notice a database handler that seems to persist for the entire batch (this is even marked with a FIXME). Could there perhaps be a leak in that?
Comment 45 Martin Senftleben 2016-10-29 10:07:58 UTC
I'm not sure if it's wise to add to this thread, but the topic is the same:

for three days now baloo_file_extractor is, according to my ksysguard, the top RAM and particularly swap memory eater. The most peculiar, but maybe normal thing is, that it doesn't eat up all RAM (I have 16 GB installed, but bsysguard shows only 9 GB being used), but feeds on swap memory. The swap file is now at its limit (6.3/6.3 GB) and the machine is slowing down. 
I tried various commands mentioned here, all failed on my machine or produce no usable output:
"ps aux | grep baloo_file_extractor" produces:
"username  3076 43.5 49.3 272558692 8087596 ?   DNl  Okt24 3018:54 /usr/bin/baloo_file_extractor
username  3575  0.0  0.0  10776  2016 pts/1    S+   12:03   0:00 grep --color=tty -d skip baloo_file_extractor"
and "sudo echo 3 > /proc/sys/vm/drop_caches" (in the hope that this might clear up space) produces:
"Keine Berechtigung" (= no permission)
Other things I didn't try.
I run Manjaro Arch-Linux up-to-date, 64 bit.