Summary: | Baloo file extractor ate all my PC's RAM | ||
---|---|---|---|
Product: | [Unmaintained] Baloo | Reporter: | Kyrylo Bohdanenko <kirill.bogdanenko> |
Component: | General | Assignee: | Pinak Ahuja <pinak.ahuja> |
Status: | RESOLVED FIXED | ||
Severity: | grave | CC: | andre.vmatos, arthur, bastian_knight, ben, boblovgren55, brendon, james.ellis, joanmaspons, jsamyth, kde, kdebugs, linux, m01brv, Martin, mmar, nospam, rick, rivaldi8, simonandric5, sonichedgehog_hyperblast00, w01dnick, zanetu |
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Kubuntu | ||
OS: | Linux | ||
Latest Commit: | http://commits.kde.org/baloo/44975bd11c4c97be13a117d5533a55e6bafaccbd | Version Fixed In: | 4.13.1 |
Sentry Crash Report: | |||
Attachments: |
baloo_file_extractor eating my RAM
Baloo using 11gb Ram massif output Little tool to show which files are currently being indexed |
Description
Kyrylo Bohdanenko
2014-03-21 23:50:15 UTC
Created attachment 85679 [details]
baloo_file_extractor eating my RAM
These issues are slightly hard to reproduce as they depend a lot on the file being indexed. You'll need to figure out which file is causing this excessive memory usage. Here is how you can do so - $ ps aux | grep baloo_file_extractor This will give you list on numbers. Each of these numbers represents a file. You can find out exactly which files those are by running balooshow - $ balooshow 4 56 65 43 .. other numbers Try running baloo_file_extractor on each of these files individually and seeing if you can reproduce it or is it only with all those numbers. In the future, I'll try to make a tool to automatically do this for you and log memory usage, but till then you'll have to do it manually. It seems like baloo_file_extractor behaves itself badly with (pretty big) *.vdi files. I have: 30G /windows/D/Work/VMs/Kubuntu Raring/Kubuntu Raring.vdi on NTFS: /dev/sda7 on /windows/D type fuseblk (rw,nosuid,nodev,allow_other,default_permissions,blksize=4096) So when I'm running baloo_file_extractor on that file it causes huge CPU load and starts to eat RAM pretty fast. I can share that *.vdi with you (if required) Interesting. I'll borrow some large vdi files from a colleague. The 4gb one I have does not cause any problems. Weirdly enough, we do not have any indexers for vdi files so they should be ignored. Could you possibly run massif on the file? $ valgrind --tool=massif baloo_file_extractor theFile.vdi It will output a file which you can then upload. In my case baloo_file_extractor eats ~2 G of RAM hanging my system on a ~4G file that contains lots of floating-point numeric data (stored as ascii text, many of very long lines ~200000 characters each) I have the same problem. baloo seem to use 100% of one core and uses up to about 10gb of ram. Stops for about a second drops everything out of the ram and starts building up again $ ps aux | grep baloo_file_extractor rick 4934 100 38.1 6609872 6253028 ? RN 10:16 2:40 /usr/bin/baloo_file_extractor 19009 19008 19007 19006 19005 19004 19003 19002 19001 19000 rick 4979 0.0 0.0 11752 924 pts/2 S+ 10:19 0:00 grep --color=auto baloo_file_extractor Created attachment 86146 [details]
Baloo using 11gb Ram
I'm not having an issue with Baloo file extractor using a lot of ram, but I am noticing it shoot up to 100% of 1 core constantly, and a lot of unnecessary disk activity that wasn't happening with KDE 4.11.5. It's surprising there's no GUI switch to turn Baloo search off; I know there are other ways, but the way it's currently implemented is that a new user may not know how to turn it off, and it's being forced down their throat if they upgrade. That in itself is not appropriate. The developers should have included the functionality to turn Baloo off via the GUI, especially when considering the fact that my HD is going crazy and it's eating away at the CPU. As a test I added my home folder to the "Desktop Search" exclusion list, and baloo file cleaner went crazy on my HD and would not stop accessing it. Then I removed the home folder from the execution list and added it back, and baloo is at a stand still. Baloo seems fairly broken to me...my system is now worse than when Nepomuk was on it. Thanks a lot KDE team. I think I tracked it down to BOINC causing Baloo to go completely bonkers. Now, not everybody is going to figure out how to add their BOINC data folder to the exclusion list, so is there a way to make Baloo play nice with BOINC without burning somebody's hard drive out in 2 weeks? I dont' use or even have BOINC installed. Thought the problem was related with VDI images, but even excluding the mountpoint where images sits, Baloo still eat all my RAM, until the system starts swapping, and become unusable. Killing baloo processes or logout didn't help, the only way the take back my RAM is restarting. baloo_file_extractor consumes all my ram with the file: X-Plane 10 Demo/Resources/default scenery/default apt dat/Earth nav data/apt.dat You can get this file by installing the X-Plane 10 Demo from http://www.x-plane.com/downloads/landing/ (my X-Plane version is 10.25). I've upgraded to Kubuntu 14.04 two days ago and have the same problem: Memory usage of baloo_file_extractor grows up to about 11 GB, then the process disappears and the next process tries to index the same file all over again. I'm adding my info here since the problematic file is publicly available. The file it is trying to index is the DNA sequence of the human genome, available here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz The problem occurred with the uncompressed file, which has about 3 GB. The file is pure ASCII, so I guess the problem is the same as the one Roman has seen since his data is also just ASCII. Of course indexing of large text files should be fixed, but perhaps it's also an idea to not index ASCII text that is larger than some threshold, perhaps 100 MB or so. And at least for FASTA files, it doesn't make sense to have things like charactor, word and line count. While I appreciate a lot the effort spent on this and would really like to use desktop search, I also think that bugs like these are a good reason for having a simple way to temporarily disable indexing, at least in this stage of development. Agreed. I'll add a 20-30 mb limit on pure text files. I've added a page over here which documents the config values used by Baloo. This can also be used to disable it. http://community.kde.org/Baloo/Configuration Using Baloo search, it's coming up with completely irrelevant search results, or doesn't find the file on my hard drive at all. This piece of software is very broken; it's not even beta, why was it released? It doesn't even work. Guys, if it's not working for you please try to provide accurate debugging info. http://community.kde.org/Baloo/Debugging Created attachment 86228 [details]
massif output
ms_print file for the extractor running on X-Plane 10 Demo/Resources/default scenery/default apt dat/Earth nav data/apt.dat.
I have attached a massif ms_print file for the extractor running on X-Plane 10 Demo/Resources/default scenery/default apt dat/Earth nav data/apt.dat. apt.dat is a 2.5 million line text file (102MB) and indexing it is too much for my system with 2GB of RAM. I have managed to stop baloo_file_extractor using up all my ram by limiting the number of lines it will index in any one file. I changed kfilemetadata's plaintextextractor.cpp lines 54-55 to: //Stop indexing after the first 500000 lines and restrict lines to 10000 chars. int MaxLines = 500000; int MaxCharsPerLine = 10000; while (!ts.atEnd() && lines < MaxLines) { QString str = ts.readLine(MaxCharsPerLine); I have also limited the number of chars in a line to prevent large lines from using up too much ram (I do not know if this is necessary). These values are currently arbitrary and some testing needs to be done with different files (I have not tested with files with long lines at all!). Limiting the number of chars in a line in this way will result in words being broken up, and as I am not familiar with Xapian or Baloo I am not sure what the consequences of this will be but I imagine they are not a serious as crippling the system by using all its RAM. This method has the advantage of at least indexing some of the file. A better solution would involve indexing the whole file (maybe by breaking the indexing into parts?). I will try to do more testing If I have time. If I have done this right, this code limits the indexing of a file to 32M characters and if it stops at that limit does not report incorrect word and line counts: void PlainTextExtractor::extract(ExtractionResult* result) { QFile file(result->inputUrl()); if (!file.open(QIODevice::ReadOnly | QIODevice::Text)) { return; } int lines = 0; int words = 0; QRegExp wordsRegex("\\b\\w+\\b"); QTextStream ts(&file); //Stop indexing after the first 32000000 chars. long int maxChars = 32000000; long int charsRead = 0; while (!ts.atEnd()) { QString str = ts.readLine(maxChars-charsRead); //Limit the readLine() to maxChars-charsRead chars (the number of chars left before hitting maximum). charsRead += str.length(); //Count how many chars we have read. // qDebug() << QString("Line read %1 chars, Total %2").arg(QString::number(str.length()), QString::number(charsRead)); result->append(str); lines += 1; words += str.count(wordsRegex); if (charsRead >= maxChars) { // qDebug() << QString("Abandoning having read %1 chars").arg(QString::number(charsRead)); result->addType(Type::Text); return; //Do not continue with this file and do not use the WordCount or LineCount. } } result->add(Property::WordCount, words); result->add(Property::LineCount, lines); result->addType(Type::Text); return; } Same problem here with large text files (e.g. http://birdtree.org/bird-tree/archives/Stage2/EricsonStage2_0001_1000.zip after extracting it, around 400Mb). I'm using Kubuntu 14.04 Confirmed. I'll either use the patches added or work on something else. I want to do some more tests. Same here with large text files. I added a little python script to show the current indexed files ba baloo. (Uses ps and balooshow) I tried to exclude files sql files by appending ",*.sql,*.json,*.log,*log" in ~/.kde4/share/config/baloofilerc but that didnt work (after re-login). I really like milou and the work you putted in that, but at the moment it eats all my ram and swap. It uses nearly constantly between 3 and 8GB RAM. So I really can't use my computer anymore until I kill baloo_file_extractor... Sadly I cannot code in c++, only python and php, otherwise I could probably fix that. Perhaps the fix above will fix it. The thing that makes me wonder is how much memory it really uses. My sql file has 960MB. If I use this little python script I added, I can see thats the only big file baloo is indexing. The other files are comparatively small. If I open this sql file in kwrite, it uses 2GB of RAM. None of the lines in this sql file is so large that readline() would eat too much RAM. So I am not shure this is the only problem here. Sadly I cannot share this SQL-file with you. Its full of sensitive data. Created attachment 86275 [details]
Little tool to show which files are currently being indexed
via balooshow, I could see the extractor causing CPU issues for me on a large log files (access logs, debug logs for development). Similar to Comment 21. Something like this should be shipped turned off with a prompt to turn it on if the user wants it and an easy way to turn it off when it causes an unworkable system. On Kubuntu, the settings file is: ~/.kde/share/config/baloorc And I followed the instructions here to turn it off, then killed the process manually. http://community.kde.org/Baloo/Configuration Git commit a572da64129d665cbee9bedd44075327aa03ed5e by Vishesh Handa. Committed on 08/05/2014 at 15:44. Pushed by vhanda into branch 'KDE/4.13'. Do not index everything whose mimetype is text/plain We do not index the file if - * It does not end in a .txt * Its size is > 50 mb We do not seem to hande large text files very well. This is a temporary solution in order to avoid the extractor taking too long to index a file and never finishing. FIXED-IN: 4.13.1 M +17 -1 src/file/extractor/app.cpp http://commits.kde.org/baloo/a572da64129d665cbee9bedd44075327aa03ed5e Git commit 44975bd11c4c97be13a117d5533a55e6bafaccbd by Christian Mollekopf, on behalf of Vishesh Handa. Committed on 08/05/2014 at 15:44. Pushed by cmollekopf into branch 'dev/scheduler'. Do not index everything whose mimetype is text/plain We do not index the file if - * It does not end in a .txt * Its size is > 50 mb We do not seem to hande large text files very well. This is a temporary solution in order to avoid the extractor taking too long to index a file and never finishing. FIXED-IN: 4.13.1 M +17 -1 src/file/extractor/app.cpp http://commits.kde.org/baloo/44975bd11c4c97be13a117d5533a55e6bafaccbd The testcase I made for another bug report shows that appending to triggers high I/O write load in baloo_file_indexer maybe this also happens with overwriting parts of larger files[1]. I didn´t monitor RAM usage, but the bash script could also be changed to create larger files. I think it would also be possible to simulate some virtualbox testcase with a flexible I/O tester job, having fio overwrite parts of large files with some random access I/O pattern. [1] Bug 333655 - Baloo indexing I/O introduces serious noticable delays https://bugs.kde.org/show_bug.cgi?id=333655#c36 Problem persists here. For now, seems like memory usage goes in orbit (10G+ of my 12G PC, and raising) indexing some big source code folders, like qt5 and android. Please, do not close a bug without FIX confirmation of users involved. Thanks. Same here, came in this morning to a completely unresponsive machine which, being a work machine with lots of things that I need still open wasn't great. After 5 minutes I managed to get to top which showed baloo_file_extractor using over half my RAM and killing my CPU. bhodget+ 759 0.0 0.1 378508 4168 ? SNl Jun02 0:02 /usr/bin/baloo_file bhodget+ 781 0.0 0.0 309124 3044 ? SN Jun02 0:00 /usr/bin/akonadi_baloo_indexer --identifier akonadi_baloo_indexer bhodget+ 20255 73.8 58.6 2715192 2316252 ? DN 09:06 1:20 /usr/bin/baloo_file_extractor 51105 51103 51102 51101 51100 51099 51098 51097 51096 51095 51094 51093 51092 51091 51089 51088 51087 51086 51085 51084 Machine was heavily into SWAP and was pretty much a dead-weight by this point. @Andre @Benjamin: I know this sounds stupid, but could you please check that you are on 4.13.1? (In reply to comment #31) > @Andre @Benjamin: I know this sounds stupid, but could you please check that > you are on 4.13.1? I'm sure I was in 4.13.1. Still hadn't tested with 4.13.2, but will do as soon as I can take off my computer from production. (In reply to comment #32) > (In reply to comment #31) > > @Andre @Benjamin: I know this sounds stupid, but could you please check that > > you are on 4.13.1? > > I'm sure I was in 4.13.1. Still hadn't tested with 4.13.2, but will do as > soon as I can take off my computer from production. No. 13.1 is fine. If it is still causing problems for you then please follow the guide over here - http://community.kde.org/Baloo/Debugging - and try to identify the file which may be causing these huge memory spikes. The comments on this report seem to indicate that text files were the problem, this should now be fixed. Just tested with 4.13.2 and kernel 3.15, and problem is gone. As I said before, with (and only with) Baloo activated, my memory usage was going very high, even starting swap with 12G of RAM, but htop/top/ps didn't showed any high memory (nor even cache) usage from any process. So, I'm convinced that the issue was related to some problems with Btrfs (used in all partitions). Maybe kernel memory not being properly freed after baloo opened, indexed and closed files. But, as you guys should imagine, it's a very hard to debug this kind of issue. Thanks very much, and keep the great work. I'm loving the speed of baloo, anxious to see all semantic (folders/files, etc) desktop features back again. I think the bug should be reopened. I use currently KDE 4.3.13 and the bug still exists. In my case Baloo stuck on a large text log file. The size of file is 3969425301 (almost 4 GB). Maybe there not large enough data type for checking the file size? This might be slightly outside the scope of this bug report. But I really really wish the KDE team would disable Baloo by default. I know file indexing is seen as a necessity for any modern OS, but IMO it's more important to keep something this unstable off. File indexing is probably the #1 reason why KDE has a reputation for being slow and bloated and unusable on machines with low resources. @Michal Piotrowski: Could you please open another bug report about the mimetype 'text/log' please? It'll be easier to track it over there. Otherwise I can reopen the bug. @Mircea: You might just be right. Maybe not on by default, or better defaults where only your Documents/Music/Images/etc are indexed by default. Perhaps. I have been thinking about this, but we need more data to actually take a call. Plus, distros are always free to ship with it disabled by default. (In reply to Vishesh Handa from comment #37) > @Michal Piotrowski: Could you please open another bug report about the > mimetype 'text/log' please? It'll be easier to track it over there. > Otherwise I can reopen the bug. I have checked that command 'file --mime-type archive-20140508-0700.log' returns 'text/plain'. So should I create new bug about 'text/log' mime type or should we leave it in this bug? I have decided to open a new bug, because I've checked that Dolphin reports this file as a prgoram log type. It looks like bug have returned. baloo_file_extractor eats a lot of memory (a few GBs) and system becomes unresponsible until that process killed, then all starts again. I have to chmod -x /usr/bin/baloo_file_extractor to make my system usable. Baloo eating up memory has always been a major problem (even back when it was called Akonadi). It's not always Baloo itself that does it... reading and writing so much to the hard drive causes a lot of caching, which you can clear with the root command "echo 3 > /proc/sys/vm/drop_caches". For this reason I always suggested that Baloo is disabled by default, but the KDE team doesn't share my opinion on this. Apparently the hack installed back in August of 2014 to prevent indexing anything but .txt file under 50 Meg is still resident in some packages of extractor/app.cpp: if (mimetype == QLatin1String("text/plain")) { + if (!url.endsWith(".txt")) { + mimetype.clear(); + } + + QFileInfo fileInfo(url); + if (fileInfo.size() >= 50 * 1024 * 1024 ) { + mimetype.clear(); + } See: http://webcache.googleusercontent.com/search?q=cache:LUTPrh1zmZ8J:r.git.net/kde-commits/2014-05/msg02993.html+&cd=4&hl=en&ct=clnk&gl=us I would like to index a whole boatload of source code, but because it does not have a file extension of .txt the extractor fails, so then Baloo gets nothing to index. It works fine on KDE4, but on Plasma 5 version 15.12.0 (Manjaro / Arch) I can't index any plain text files that do not end with an extension of .txt. Replacing dead link in above reply: http://osdir.com/ml/kde-commits/2014-05/msg02993.html This is becoming a very debilitating limitation for me. Does the "avoid indexing large text files" hack in extractor/app.cpp (which in master has a 10 MB limit) still work if the text file starts small but happens to rapidly grow large from another process at the same time it is being indexed? I had a simulation running that appended (a lot of) numeric data to an initially empty text file, and by my observation it seemed like baloo blew up on it. I'm unfamiliar with baloo's code, but taking a glance at it (particularly app.cpp) suggests to me that this scenario might be possible. On a separate machine, and maybe a separate issue, I noticed baloo_file_extractor running for a long time with significant RAM usage (>2GB) that did not decline until it had apparently finished indexing a few thousand files that I had recently changed. Why would it need to hold so much RAM as it goes from one file to the next? In the source I notice a database handler that seems to persist for the entire batch (this is even marked with a FIXME). Could there perhaps be a leak in that? I'm not sure if it's wise to add to this thread, but the topic is the same: for three days now baloo_file_extractor is, according to my ksysguard, the top RAM and particularly swap memory eater. The most peculiar, but maybe normal thing is, that it doesn't eat up all RAM (I have 16 GB installed, but bsysguard shows only 9 GB being used), but feeds on swap memory. The swap file is now at its limit (6.3/6.3 GB) and the machine is slowing down. I tried various commands mentioned here, all failed on my machine or produce no usable output: "ps aux | grep baloo_file_extractor" produces: "username 3076 43.5 49.3 272558692 8087596 ? DNl Okt24 3018:54 /usr/bin/baloo_file_extractor username 3575 0.0 0.0 10776 2016 pts/1 S+ 12:03 0:00 grep --color=tty -d skip baloo_file_extractor" and "sudo echo 3 > /proc/sys/vm/drop_caches" (in the hope that this might clear up space) produces: "Keine Berechtigung" (= no permission) Other things I didn't try. I run Manjaro Arch-Linux up-to-date, 64 bit. |