Bug 321225 - ODF indexer + huge dataset = HUGE memory usage
Summary: ODF indexer + huge dataset = HUGE memory usage
Status: RESOLVED FIXED
Alias: None
Product: nepomuk
Classification: Miscellaneous
Component: fileindexer (show other bugs)
Version: 4.10.80
Platform: Chakra Linux
: NOR normal
Target Milestone: ---
Assignee: Nepomuk Bugs Coordination
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-06-16 14:55 UTC by Alejandro Nova
Modified: 2017-08-21 15:04 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alejandro Nova 2013-06-16 14:55:19 UTC
I tried to index a dataset of 120,000 rows, stored in a ODS format.

nepomukfileindexer tried to use 3.2 GB of RAM to index it.

Reproducible: Always

Steps to Reproduce:
1. Index huge file with nepomukfileindexer
2. Memory usage skyrockets
Actual Results:  
Memory usage skyrockets

Expected Results:  
Memory usage doesn't skyrocket. The file isn't entirely loaded in memory, but in manageable chunks.

This isn't a memory leak or something like that, virtuoso and nepomukstorage are behaving very well. It's only nepomukfileindexer.

Chakra packages + soprano from master
Comment 1 Vishesh Handa 2013-06-16 16:31:56 UTC
Are you sure it's the 'nepomukfileindexer' and not the 'nepomukindexer' ?
Comment 2 Alejandro Nova 2013-06-16 16:38:00 UTC
Typo, you are right, it's nepomukindexer
Comment 3 Vishesh Handa 2013-06-16 16:55:47 UTC
Could you possibly send me one of those ODF files which is causing this huge memory usage?

Also, each time a file is indexed, a new nepomukindexer process is spawned, so I'm not sure why/how the memory usage stays for some time.
Comment 4 Alejandro Nova 2013-06-16 21:25:02 UTC
1. It's a huge file (it's an addresses list with 130K addresses, a .ods file weighing at 7.7 MB, the same beast in Excel format was ~30 MB). Where can I send that? Do KDE bug attachment system support such a file?
2. Easy.
a) A nepomukindexer process is spawned to index that file.
b) After some time with that process eating 2.4 GB + 1.2 swap (it's actually 3.6 GB on a 3 GB system) Nepomuk kills the process, to prevent OOMing my system.
c) Rinse and repeat. nepomukindexer never indexes fully that file.
Comment 5 Vishesh Handa 2013-06-17 09:12:02 UTC
Try uploading it somewhere and then posting a link or just email it to me.
Comment 6 Vishesh Handa 2013-06-21 22:58:31 UTC
Git commit c2b902382f3ee34131d480348a9f48c9ceabfa79 by Vishesh Handa.
Committed on 21/06/2013 at 18:21.
Pushed by vhanda into branch 'master'.

Indexer: Make the plugins only extract a part of the full text

Introduce a maxPlainTextSize() which informs the plugin how much text
they should extract.

This is useful in two ways -

1. Sometimes one doesn't want any of the plain text, so one can set it
   to 0. This is used by the FileMetadataWidget to directly display the
   indexed data. Since we do not show the plain text, we do not need to
   extract it.

2. Virtuoso cannot handle queries above a certain number of bytes.
   2500243 seems to be the magic number. If you go above this limit,
   just a '0' is stored. Therefore it doesn't make sense to extract all
   of the plain text, when virtuoso can clearly not handle all of it.

   Virtuoso does not support streaming in text

M  +6    -0    services/fileindexer/indexer/epubextractor.cpp
M  +21   -0    services/fileindexer/indexer/extractorplugin.cpp
M  +13   -0    services/fileindexer/indexer/extractorplugin.h
M  +4    -1    services/fileindexer/indexer/indexer.cpp
M  +4    -0    services/fileindexer/indexer/main.cpp
M  +1    -1    services/fileindexer/indexer/mobipocket/mobiextractor.cpp
M  +5    -0    services/fileindexer/indexer/odfextractor.cpp
M  +11   -1    services/fileindexer/indexer/office2007extractor.cpp
M  +4    -8    services/fileindexer/indexer/popplerextractor.cpp

http://commits.kde.org/nepomuk-core/c2b902382f3ee34131d480348a9f48c9ceabfa79