I tried to index a dataset of 120,000 rows, stored in a ODS format.
nepomukfileindexer tried to use 3.2 GB of RAM to index it.
Steps to Reproduce:
1. Index huge file with nepomukfileindexer
2. Memory usage skyrockets
Memory usage skyrockets
Memory usage doesn't skyrocket. The file isn't entirely loaded in memory, but in manageable chunks.
This isn't a memory leak or something like that, virtuoso and nepomukstorage are behaving very well. It's only nepomukfileindexer.
Chakra packages + soprano from master
Are you sure it's the 'nepomukfileindexer' and not the 'nepomukindexer' ?
Typo, you are right, it's nepomukindexer
Could you possibly send me one of those ODF files which is causing this huge memory usage?
Also, each time a file is indexed, a new nepomukindexer process is spawned, so I'm not sure why/how the memory usage stays for some time.
1. It's a huge file (it's an addresses list with 130K addresses, a .ods file weighing at 7.7 MB, the same beast in Excel format was ~30 MB). Where can I send that? Do KDE bug attachment system support such a file?
a) A nepomukindexer process is spawned to index that file.
b) After some time with that process eating 2.4 GB + 1.2 swap (it's actually 3.6 GB on a 3 GB system) Nepomuk kills the process, to prevent OOMing my system.
c) Rinse and repeat. nepomukindexer never indexes fully that file.
Try uploading it somewhere and then posting a link or just email it to me.
Git commit c2b902382f3ee34131d480348a9f48c9ceabfa79 by Vishesh Handa.
Committed on 21/06/2013 at 18:21.
Pushed by vhanda into branch 'master'.
Indexer: Make the plugins only extract a part of the full text
Introduce a maxPlainTextSize() which informs the plugin how much text
they should extract.
This is useful in two ways -
1. Sometimes one doesn't want any of the plain text, so one can set it
to 0. This is used by the FileMetadataWidget to directly display the
indexed data. Since we do not show the plain text, we do not need to
2. Virtuoso cannot handle queries above a certain number of bytes.
2500243 seems to be the magic number. If you go above this limit,
just a '0' is stored. Therefore it doesn't make sense to extract all
of the plain text, when virtuoso can clearly not handle all of it.
Virtuoso does not support streaming in text
M +6 -0 services/fileindexer/indexer/epubextractor.cpp
M +21 -0 services/fileindexer/indexer/extractorplugin.cpp
M +13 -0 services/fileindexer/indexer/extractorplugin.h
M +4 -1 services/fileindexer/indexer/indexer.cpp
M +4 -0 services/fileindexer/indexer/main.cpp
M +1 -1 services/fileindexer/indexer/mobipocket/mobiextractor.cpp
M +5 -0 services/fileindexer/indexer/odfextractor.cpp
M +11 -1 services/fileindexer/indexer/office2007extractor.cpp
M +4 -8 services/fileindexer/indexer/popplerextractor.cpp