Bug 316075

Summary: nepomukindexer grabs 25% of CPU when it indexes pdf files
Product: [Unmaintained] nepomuk Reporter: Stavros Mekesis <stavros.mek>
Component: fileindexerAssignee: Nepomuk Bugs Coordination <nepomuk-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: kde, kde, me, nepomuk-bugs
Priority: NOR    
Version: 4.10.1   
Target Milestone: ---   
Platform: Arch Linux   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: Callgrind of nepomukindexer indexing larger PDF file

Description Stavros Mekesis 2013-03-03 16:08:20 UTC
After opening some .pdf files with Okular, nepomukindexer grabs 25% of CPU. Well, it happens sometimes, but not always. Moreover, killing the process doesn't guarantee that the problem will be solved. The odds are that it will run again after a few minutes and grab 25% of CPU. 

Reproducible: Sometimes

Steps to Reproduce:
1. Turning on the laptop
2. Opening .pdf files with Okular
Actual Results:  
nepomukindexer grabs 25% of CPU


Kernel version: 3.7.9-2-ARCH x86_64 GNU/Linux
Comment 1 Vishesh Handa 2013-03-04 00:18:50 UTC
The PDF files are probably quite big, so it takes some time to index them. Is the '25% cpu' usage for sometime such a big problem?

( BTW, this is more of a Dolphin thing than a Okular thing. You select the file and that loads the metadata. It sees that it doesn't have any metadata, and indexes the files -> Needs to load all the plain text content. This seems to take a while.

Does it get stuck or just consume cpu for a bit?
Comment 2 Stavros Mekesis 2013-03-04 10:12:25 UTC
Unfortunately, it gets stuck, so I have to kill the process because my laptop gets hot. The PDF files are over 5-10 MB. It's no big deal to kill it, but it starts again after a few minutes and gets stuck.
Comment 3 Vishesh Handa 2013-03-04 11:56:17 UTC
(In reply to comment #2)
> Unfortunately, it gets stuck, so I have to kill the process because my
> laptop gets hot. The PDF files are over 5-10 MB. It's no big deal to kill
> it, but it starts again after a few minutes and gets stuck.

Do you think you could run $nepomukindexer <pdfFile> and see how long it actually takes? Maybe gdb into the process and get the backtrace so that we know what it is doing?

Or maybe you could just upload the pdf file? :)
Comment 4 Janek Bevendorff 2013-03-05 13:34:33 UTC
nepomukindexer should really limit its maximum CPU usage to 50 or 20%, if not even lower.

I have several PDF files here of about 10, sometimes 20 or even 100MB. Since I completely removed my messy pre-4.10 file index, Nepomuk is currently reindexing everything. This process takes a long time which is okay, I'm not in a hurry. But when my PC is idle for a minute or so and nepomukindexer becomes active and starts indexing those big PDF files, it runs at 100% CPU all the time which noticeably turns up the fan speed and the power consumption.

I already had to write a shell script that watches Nepomuk and limits its CPU usage when it's starting to play the CPU hog again.
Comment 5 Janek Bevendorff 2013-03-05 13:38:55 UTC
Addition: it doesn't really get stuck here. I turned up the memory limit for Nepomuk temporarily to accelerate the indexing process and let it run for a while. After some time it finished the one file and started another one.
So it *does* index the files, but it consumes far too much CPU time.
Comment 6 Stavros Mekesis 2013-03-05 22:15:36 UTC
I tried to reproduce the problem many times, only to find that nepomukindexer consumed way too much CPU time regardless of what I did. For example, I was just browsing the web when it became active and started the entire indexing process. Whether or not it gets stuck, nepomukindexer takes a long time and the CPU fan speeds up very high. Moreover, "$nepomukindexer <pdfFile>" took only a few seconds. I'll try gdb tomorrow. Actually, Janek Bevendorff gave a great explanation οf the problem. 

Janek, could you upload your shell script?
Comment 7 Janek Bevendorff 2013-03-05 22:25:45 UTC
Actually, I blogged the script today: 
http://www.refining-linux.org/archives/64/Programmatically-limit-CPU-usage-of-certain-processes/
It's a dirty hack, but it works as a workaround. In general it's 
nothing more than a script that calls cpulimit (may or may not be in 
your distro's repositories, link to the GitHub page in the blog post) 
for nepomukindexer processes. Could also be done with cgroups, though.
Comment 8 Vishesh Handa 2013-03-06 09:41:09 UTC
Alright, so I've pushed this into KDE/4.10 -

commit 5b746fe5ac9c32bd32830995a34c2849c813716d
Author: Vishesh Handa <me@vhanda.in>
Date:   Wed Mar 6 14:59:56 2013 +0530

    PopplerExtractor: Do not extract all the plain text
    
    Virtuoso cannot handle all the plain text, it's best to only extract how
    much virtuoso can handle, instead of discarding the extra text later.

It should improve the situation, specially since we don't save the entire plain text index, but we will need to find a better solution for the cpu and memory usage. I have an idea for fixing the memory usage, but it's a HUGE change and will probably require extensive testing. It might be ready in time for 4.12. Or maybe I'll try to contact the virtuoso team again to see if there is any way to stream in the text.

I'm not sure what to do about the CPU usage. Could someone maybe give me a callgrind log of the nepomukindexer indexing the pdf file? That way we'll know where the bottle neck is.
Comment 9 Janek Bevendorff 2013-03-06 10:39:05 UTC
Created attachment 77794 [details]
Callgrind of nepomukindexer indexing larger PDF file

I attached a callgrind of `nepomukindexer file.pdf`. I installed debug symbols for nepomuk-core, if you need more, please tell me.

The behavior when running nepomukindexer manually seems to be the same as the automatic indexing. It brings the CPU up to about 99%. The only difference is that it finishes much quicker (valgrind slowed it down, of course, but when run directly it only takes a few seconds). But I guess that's simply because of the memory limit that is applied for the automatic indexing.
Comment 10 Vishesh Handa 2013-06-10 15:00:24 UTC
With the 4.11 release, if you try and kill the nepomukindexer, it will not try to reindex that same file again. That's all I can do - we need to extract the plain text content of the file, and based on the callgrind log, it seems that all of them is just spent in fetching it from the huge pdf file.

We do limit the amount of text to how much we can store, but I'm not sure what else we can do to improve the situation. Maybe future releases of poppler will make the indexing of pdf files much faster. I cannot think of anything else.

I'm marking this as fixed a the above patch improves the situation and there isn't much apart from that which we can do.