Bug 238138 - strigi pdf indexing is broken
Summary: strigi pdf indexing is broken
Status: RESOLVED FIXED
Alias: None
Product: nepomuk
Classification: Miscellaneous
Component: general (show other bugs)
Version: unspecified
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Sebastian Trueg
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-19 09:30 UTC by Michi
Modified: 2011-01-05 18:31 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michi 2010-05-19 09:30:20 UTC
Version:            (using Devel)
Compiler:          gcc 4.5 
OS:                Linux
Installed from:    Compiled sources

Hi,

I guess since strigi 0.7.2 I have a problem with indexing pdf files. Whenever the indexer hits a pdf it hogs up one cpu core and simply hangs.

When I do a manual run with strigicmd, I get something like the following log message:

'' is not a UTF8 or latin1 string
Error in parsing: Keyword obj not found.

That's when the process starts hanging.
Comment 1 Jure Repinc 2010-06-06 15:35:47 UTC
I think I have the same problem here in KDE SC 4.5 compiled from trunk on 2010-06-05. When I check the process list I see "/home/kde-devel/kde/bin/nepomukservicestub nepomukstrigiservice" using all the CPU. I attached to it with GDB and got this backtrace:

(gdb) thread apply all where

Thread 2 (Thread 0x7f75cabe6710 (LWP 18485)):
#0  0x00000034492c44cd in read () from /lib/libc.so.6
#1  0x000000344926e42f in ?? () from /lib/libc.so.6
#2  0x0000003449263e09 in fread () from /lib/libc.so.6
#3  0x00007f75d0c180d5 in Strigi::SkippingFileInputStream::read(char const*&, int, int) () from /home/kde-devel/kde/lib/libstreams.so.0
#4  0x00007f75d0c01044 in Strigi::DataEventInputStream::read(char const*&, int, int) () from /home/kde-devel/kde/lib/libstreams.so.0
#5  0x00007f75d0eef6bd in PdfParser::read(int, int) () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#6  0x00007f75d0eef7f0 in PdfParser::checkForData(int) () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#7  0x00007f75d0eefa7d in PdfParser::skipNotFromString(char const*, int) () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#8  0x00007f75d0ef0331 in PdfParser::parseName() () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#9  0x00007f75d0ef05aa in PdfParser::parseDictionaryOrStream() () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#10 0x00007f75d0ef0fcb in PdfParser::parseObjectStreamObject(int) () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#11 0x00007f75d0ef1689 in PdfParser::parseObjectStreamObjectDef() () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#12 0x00007f75d0ef17eb in PdfParser::parse(Strigi::StreamBase<char>*) () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#13 0x00007f75d0f289c1 in PdfEndAnalyzer::analyze(Strigi::AnalysisResult&, Strigi::StreamBase<char>*) ()
   from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#14 0x00007f75d0efdfdc in Strigi::StreamAnalyzerPrivate::analyze(Strigi::AnalysisResult&, Strigi::StreamBase<char>*) ()
   from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#15 0x00007f75d0efdac4 in Strigi::StreamAnalyzer::analyze(Strigi::AnalysisResult&, Strigi::StreamBase<char>*) ()
   from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#16 0x00007f75d0ebfdf0 in Strigi::AnalysisResult::index(Strigi::StreamBase<char>*) () from /home/kde-devel/kde/lib/libstreamanalyzer.so.0.7
#17 0x00007f75cbced915 in Nepomuk::IndexScheduler::analyzeFile (this=0x1ecb4b0, file=..., analyzer=0x7f75cabe5dd0)
    at /home/kde-devel/kde/src/kdebase/runtime/nepomuk/services/strigi/indexscheduler.cpp:429
#18 0x00007f75cbced3b6 in Nepomuk::IndexScheduler::updateDir (this=0x1ecb4b0, dir=..., analyzer=0x7f75cabe5dd0, flags=...)
    at /home/kde-devel/kde/src/kdebase/runtime/nepomuk/services/strigi/indexscheduler.cpp:395
#19 0x00007f75cbcec9e7 in Nepomuk::IndexScheduler::run (this=0x1ecb4b0)
    at /home/kde-devel/kde/src/kdebase/runtime/nepomuk/services/strigi/indexscheduler.cpp:296
#20 0x00007f75d52d2570 in QThreadPrivate::start (arg=0x1ecb4b0) at thread/qthread_unix.cpp:266
#21 0x0000003449e068e4 in start_thread () from /lib/libpthread.so.0
#22 0x00000034492d129d in clone () from /lib/libc.so.6

Thread 1 (Thread 0x7f75d1b5b760 (LWP 18409)):
#0  0x00000034492c8573 in poll () from /lib/libc.so.6
#1  0x000000344d23e6bc in ?? () from /usr/lib/libglib-2.0.so.0
#2  0x000000344d23ea00 in g_main_context_iteration () from /usr/lib/libglib-2.0.so.0
#3  0x00007f75d544414f in QEventDispatcherGlib::processEvents (this=0x1d6d120, flags=...) at kernel/qeventdispatcher_glib.cpp:412
#4  0x00007f75d2d4f588 in QGuiEventDispatcherGlib::processEvents (this=0x1d6d120, flags=...) at kernel/qguieventdispatcher_glib.cpp:204
#5  0x00007f75d5401e8c in QEventLoop::processEvents (this=0x7fffe0dfa7f0, flags=...) at kernel/qeventloop.cpp:149
#6  0x00007f75d5401fe2 in QEventLoop::exec (this=0x7fffe0dfa7f0, flags=...) at kernel/qeventloop.cpp:201
#7  0x00007f75d5405590 in QCoreApplication::exec () at kernel/qcoreapplication.cpp:1009
#8  0x00007f75d2c60ab0 in QApplication::exec () at kernel/qapplication.cpp:3665
#9  0x0000000000404102 in main (argc=2, argv=0x7fffe0dfad58) at /home/kde-devel/kde/src/kdebase/runtime/nepomuk/servicestub/main.cpp:152
Comment 2 phreedom.stdin 2010-08-22 03:09:46 UTC
Can't fix this unless I have a copy of the file in question :(
Comment 3 Michi 2010-08-22 07:25:09 UTC
(In reply to comment #2)
> Can't fix this unless I have a copy of the file in question :(

Actually it was almost any pdf. And indeed it _was_, because since 4.5 release the problem has simply vanished.
Comment 4 Sebastian Trueg 2011-01-05 18:31:59 UTC
Closing as fixed due to the last comment about 4.5