After opening some .pdf files with Okular, nepomukindexer grabs 25% of CPU. Well, it happens sometimes, but not always. Moreover, killing the process doesn't guarantee that the problem will be solved. The odds are that it will run again after a few minutes and grab 25% of CPU. Reproducible: Sometimes Steps to Reproduce: 1. Turning on the laptop 2. Opening .pdf files with Okular Actual Results: nepomukindexer grabs 25% of CPU Kernel version: 3.7.9-2-ARCH x86_64 GNU/Linux
The PDF files are probably quite big, so it takes some time to index them. Is the '25% cpu' usage for sometime such a big problem? ( BTW, this is more of a Dolphin thing than a Okular thing. You select the file and that loads the metadata. It sees that it doesn't have any metadata, and indexes the files -> Needs to load all the plain text content. This seems to take a while. Does it get stuck or just consume cpu for a bit?
Unfortunately, it gets stuck, so I have to kill the process because my laptop gets hot. The PDF files are over 5-10 MB. It's no big deal to kill it, but it starts again after a few minutes and gets stuck.
(In reply to comment #2) > Unfortunately, it gets stuck, so I have to kill the process because my > laptop gets hot. The PDF files are over 5-10 MB. It's no big deal to kill > it, but it starts again after a few minutes and gets stuck. Do you think you could run $nepomukindexer <pdfFile> and see how long it actually takes? Maybe gdb into the process and get the backtrace so that we know what it is doing? Or maybe you could just upload the pdf file? :)
nepomukindexer should really limit its maximum CPU usage to 50 or 20%, if not even lower. I have several PDF files here of about 10, sometimes 20 or even 100MB. Since I completely removed my messy pre-4.10 file index, Nepomuk is currently reindexing everything. This process takes a long time which is okay, I'm not in a hurry. But when my PC is idle for a minute or so and nepomukindexer becomes active and starts indexing those big PDF files, it runs at 100% CPU all the time which noticeably turns up the fan speed and the power consumption. I already had to write a shell script that watches Nepomuk and limits its CPU usage when it's starting to play the CPU hog again.
Addition: it doesn't really get stuck here. I turned up the memory limit for Nepomuk temporarily to accelerate the indexing process and let it run for a while. After some time it finished the one file and started another one. So it *does* index the files, but it consumes far too much CPU time.
I tried to reproduce the problem many times, only to find that nepomukindexer consumed way too much CPU time regardless of what I did. For example, I was just browsing the web when it became active and started the entire indexing process. Whether or not it gets stuck, nepomukindexer takes a long time and the CPU fan speeds up very high. Moreover, "$nepomukindexer <pdfFile>" took only a few seconds. I'll try gdb tomorrow. Actually, Janek Bevendorff gave a great explanation οf the problem. Janek, could you upload your shell script?
Actually, I blogged the script today: http://www.refining-linux.org/archives/64/Programmatically-limit-CPU-usage-of-certain-processes/ It's a dirty hack, but it works as a workaround. In general it's nothing more than a script that calls cpulimit (may or may not be in your distro's repositories, link to the GitHub page in the blog post) for nepomukindexer processes. Could also be done with cgroups, though.
Alright, so I've pushed this into KDE/4.10 - commit 5b746fe5ac9c32bd32830995a34c2849c813716d Author: Vishesh Handa <me@vhanda.in> Date: Wed Mar 6 14:59:56 2013 +0530 PopplerExtractor: Do not extract all the plain text Virtuoso cannot handle all the plain text, it's best to only extract how much virtuoso can handle, instead of discarding the extra text later. It should improve the situation, specially since we don't save the entire plain text index, but we will need to find a better solution for the cpu and memory usage. I have an idea for fixing the memory usage, but it's a HUGE change and will probably require extensive testing. It might be ready in time for 4.12. Or maybe I'll try to contact the virtuoso team again to see if there is any way to stream in the text. I'm not sure what to do about the CPU usage. Could someone maybe give me a callgrind log of the nepomukindexer indexing the pdf file? That way we'll know where the bottle neck is.
Created attachment 77794 [details] Callgrind of nepomukindexer indexing larger PDF file I attached a callgrind of `nepomukindexer file.pdf`. I installed debug symbols for nepomuk-core, if you need more, please tell me. The behavior when running nepomukindexer manually seems to be the same as the automatic indexing. It brings the CPU up to about 99%. The only difference is that it finishes much quicker (valgrind slowed it down, of course, but when run directly it only takes a few seconds). But I guess that's simply because of the memory limit that is applied for the automatic indexing.
With the 4.11 release, if you try and kill the nepomukindexer, it will not try to reindex that same file again. That's all I can do - we need to extract the plain text content of the file, and based on the callgrind log, it seems that all of them is just spent in fetching it from the huge pdf file. We do limit the amount of text to how much we can store, but I'm not sure what else we can do to improve the situation. Maybe future releases of poppler will make the indexing of pdf files much faster. I cannot think of anything else. I'm marking this as fixed a the above patch improves the situation and there isn't much apart from that which we can do.