Bug 294727

Summary:	Most (but not all) of the PDF files cannot be handled correctly in Strigi (nepomukindexer cannot index them)
Product:	[Unmaintained] nepomuk	Reporter:	Vangelis <cyberang3l>
Component:	fileindexer	Assignee:	Sebastian Trueg <sebastian>
Status:	RESOLVED FIXED
Severity:	normal	CC:	lacsilva, me, stephanolbrich
Priority:	NOR
Version First Reported In:	4.8
Target Milestone:	---
Platform:	Ubuntu
OS:	Linux
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	One more of the PDFs fails to get indexed

Description Vangelis 2012-02-24 02:35:26 UTC

Version:           4.8 (using KDE 4.8.0) 
OS:                Linux

I have many pdf files, that they do not appear in the nepomuk search results.
I made many search queries using keywords I know these PDFs contain, or I even searched directly using their filename from within dolphin, krunner or nepoogle.
I removed my nepomuk database, I tried to index these PDF file using the nepomukindexer command, I have moved the around to different folders, but it is impossibly to get them indexed.

For some of these PDFs, nepomukindexer doesn't return anything (suppose that there is no error) but for some others different errors are returned. Nepomukindex exit status is always 0 though... I guess that's probably a different bug by itself. 

Reproducible: Always

Steps to Reproduce:
Try to index some pdf files using nepomukindexer command.
Then use nepoogle to see if they have entered the nepomuk database.

./nepoogle url:"file name of pdf.pdf"

One PDF surely cannot be indexed for me (but no error returned by nepomukindexer) is the MLN Manual you can get from this sourceforge link:
http://mln.sourceforge.net/doc/mln-manual.pdf

Actual Results:  
PDF file is not indexed, thus no results showing to queries related to that file.

Expected Results:  
All PDF files should be able to get indexed correctly by nepomuk.

There is a whole thread in KDE forums with lots of information and ways tried to solve this unsuccessfully.
http://forum.kde.org/viewtopic.php?f=154&t=99385

Comment 1 Vangelis 2012-02-24 02:39:22 UTC

Created attachment 69048 [details]
One more of the PDFs fails to get indexed

Comment 2 Stephan Olbrich 2012-04-21 10:10:42 UTC

I have the same problem.
With xmlindexer I get a lot of information about the pdfs (metadata and content), but nepomukindexer returns without printing anything.
Monitoring with
sopranocmd --dbus org.kde.NepomukStorage --model main monitor
shows nothing.

The files in question show nothing when opened in nepomukshell and show no hash in dolphin.

Comment 3 Luis Silva 2012-07-14 15:36:11 UTC

I guess that bugs #285128 and #234069 could be could be clusterd in this one. This is a problem with the strigi analyser. In the repo there are two branches with alternative analisers: "newPdfAnalyzer" and "popplerPdfAnalyzer". Although in incomplete state, both  these alternatives produce better results than the default pdf analiser.
Please, could any of the developers involved take a stab at pushing any of these alternatives as the default?

Comment 4 Vishesh Handa 2012-12-03 07:36:47 UTC

In KDE 4.10, we have moved away from Strigi and are using our own indexer based on poppler. I'm not marking this bug as fixed, as the indexer has not been thoroughly tested. It could still use some polish.

I'll mark this as fixed, when I have tested it adequately.

Comment 5 Vishesh Handa 2012-12-27 08:37:32 UTC

This new PDF analyzer works quite well :)