Summary: | ODF indexer + huge dataset = HUGE memory usage | ||
---|---|---|---|
Product: | [Unmaintained] nepomuk | Reporter: | Alejandro Nova <alejandronova> |
Component: | fileindexer | Assignee: | Nepomuk Bugs Coordination <nepomuk-bugs> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | me, MYCCISEMPTY, nepomuk-bugs |
Priority: | NOR | ||
Version: | 4.10.80 | ||
Target Milestone: | --- | ||
Platform: | Chakra | ||
OS: | Linux | ||
Latest Commit: | http://commits.kde.org/nepomuk-core/c2b902382f3ee34131d480348a9f48c9ceabfa79 | Version Fixed In: | |
Sentry Crash Report: |
Description
Alejandro Nova
2013-06-16 14:55:19 UTC
Are you sure it's the 'nepomukfileindexer' and not the 'nepomukindexer' ? Typo, you are right, it's nepomukindexer Could you possibly send me one of those ODF files which is causing this huge memory usage? Also, each time a file is indexed, a new nepomukindexer process is spawned, so I'm not sure why/how the memory usage stays for some time. 1. It's a huge file (it's an addresses list with 130K addresses, a .ods file weighing at 7.7 MB, the same beast in Excel format was ~30 MB). Where can I send that? Do KDE bug attachment system support such a file? 2. Easy. a) A nepomukindexer process is spawned to index that file. b) After some time with that process eating 2.4 GB + 1.2 swap (it's actually 3.6 GB on a 3 GB system) Nepomuk kills the process, to prevent OOMing my system. c) Rinse and repeat. nepomukindexer never indexes fully that file. Try uploading it somewhere and then posting a link or just email it to me. Git commit c2b902382f3ee34131d480348a9f48c9ceabfa79 by Vishesh Handa. Committed on 21/06/2013 at 18:21. Pushed by vhanda into branch 'master'. Indexer: Make the plugins only extract a part of the full text Introduce a maxPlainTextSize() which informs the plugin how much text they should extract. This is useful in two ways - 1. Sometimes one doesn't want any of the plain text, so one can set it to 0. This is used by the FileMetadataWidget to directly display the indexed data. Since we do not show the plain text, we do not need to extract it. 2. Virtuoso cannot handle queries above a certain number of bytes. 2500243 seems to be the magic number. If you go above this limit, just a '0' is stored. Therefore it doesn't make sense to extract all of the plain text, when virtuoso can clearly not handle all of it. Virtuoso does not support streaming in text M +6 -0 services/fileindexer/indexer/epubextractor.cpp M +21 -0 services/fileindexer/indexer/extractorplugin.cpp M +13 -0 services/fileindexer/indexer/extractorplugin.h M +4 -1 services/fileindexer/indexer/indexer.cpp M +4 -0 services/fileindexer/indexer/main.cpp M +1 -1 services/fileindexer/indexer/mobipocket/mobiextractor.cpp M +5 -0 services/fileindexer/indexer/odfextractor.cpp M +11 -1 services/fileindexer/indexer/office2007extractor.cpp M +4 -8 services/fileindexer/indexer/popplerextractor.cpp http://commits.kde.org/nepomuk-core/c2b902382f3ee34131d480348a9f48c9ceabfa79 |