Bug 321225

Summary: ODF indexer + huge dataset = HUGE memory usage
Product: nepomuk Reporter: Alejandro Nova <alejandronova>
Component: fileindexerAssignee: Nepomuk Bugs Coordination <nepomuk-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: me, MYCCISEMPTY, nepomuk-bugs
Priority: NOR    
Version: 4.10.80   
Target Milestone: ---   
Platform: Chakra   
OS: Linux   
Latest Commit: Version Fixed In:

Description Alejandro Nova 2013-06-16 14:55:19 UTC
I tried to index a dataset of 120,000 rows, stored in a ODS format.

nepomukfileindexer tried to use 3.2 GB of RAM to index it.

Reproducible: Always

Steps to Reproduce:
1. Index huge file with nepomukfileindexer
2. Memory usage skyrockets
Actual Results:  
Memory usage skyrockets

Expected Results:  
Memory usage doesn't skyrocket. The file isn't entirely loaded in memory, but in manageable chunks.

This isn't a memory leak or something like that, virtuoso and nepomukstorage are behaving very well. It's only nepomukfileindexer.

Chakra packages + soprano from master
Comment 1 Vishesh Handa 2013-06-16 16:31:56 UTC
Are you sure it's the 'nepomukfileindexer' and not the 'nepomukindexer' ?
Comment 2 Alejandro Nova 2013-06-16 16:38:00 UTC
Typo, you are right, it's nepomukindexer
Comment 3 Vishesh Handa 2013-06-16 16:55:47 UTC
Could you possibly send me one of those ODF files which is causing this huge memory usage?

Also, each time a file is indexed, a new nepomukindexer process is spawned, so I'm not sure why/how the memory usage stays for some time.
Comment 4 Alejandro Nova 2013-06-16 21:25:02 UTC
1. It's a huge file (it's an addresses list with 130K addresses, a .ods file weighing at 7.7 MB, the same beast in Excel format was ~30 MB). Where can I send that? Do KDE bug attachment system support such a file?
2. Easy.
a) A nepomukindexer process is spawned to index that file.
b) After some time with that process eating 2.4 GB + 1.2 swap (it's actually 3.6 GB on a 3 GB system) Nepomuk kills the process, to prevent OOMing my system.
c) Rinse and repeat. nepomukindexer never indexes fully that file.
Comment 5 Vishesh Handa 2013-06-17 09:12:02 UTC
Try uploading it somewhere and then posting a link or just email it to me.
Comment 6 Vishesh Handa 2013-06-21 22:58:31 UTC
Git commit c2b902382f3ee34131d480348a9f48c9ceabfa79 by Vishesh Handa.
Committed on 21/06/2013 at 18:21.
Pushed by vhanda into branch 'master'.

Indexer: Make the plugins only extract a part of the full text

Introduce a maxPlainTextSize() which informs the plugin how much text
they should extract.

This is useful in two ways -

1. Sometimes one doesn't want any of the plain text, so one can set it
   to 0. This is used by the FileMetadataWidget to directly display the
   indexed data. Since we do not show the plain text, we do not need to
   extract it.

2. Virtuoso cannot handle queries above a certain number of bytes.
   2500243 seems to be the magic number. If you go above this limit,
   just a '0' is stored. Therefore it doesn't make sense to extract all
   of the plain text, when virtuoso can clearly not handle all of it.

   Virtuoso does not support streaming in text

M  +6    -0    services/fileindexer/indexer/epubextractor.cpp
M  +21   -0    services/fileindexer/indexer/extractorplugin.cpp
M  +13   -0    services/fileindexer/indexer/extractorplugin.h
M  +4    -1    services/fileindexer/indexer/indexer.cpp
M  +4    -0    services/fileindexer/indexer/main.cpp
M  +1    -1    services/fileindexer/indexer/mobipocket/mobiextractor.cpp
M  +5    -0    services/fileindexer/indexer/odfextractor.cpp
M  +11   -1    services/fileindexer/indexer/office2007extractor.cpp
M  +4    -8    services/fileindexer/indexer/popplerextractor.cpp

http://commits.kde.org/nepomuk-core/c2b902382f3ee34131d480348a9f48c9ceabfa79