294727 – Most (but not all) of the PDF files cannot be handled correctly in Strigi (nepomukindexer cannot index them)

Bug 294727 - Most (but not all) of the PDF files cannot be handled correctly in Strigi (nepomukindexer cannot index them)

Summary: Most (but not all) of the PDF files cannot be handled correctly in Strigi (ne...

Status:	RESOLVED FIXED

Alias:	None

Product:	nepomuk
Classification:	Unmaintained
Component:	fileindexer (show other bugs)
Version:	4.8
Platform:	Ubuntu Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Sebastian Trueg

URL:
Keywords:

Depends on:
Blocks:

Reported:	2012-02-24 02:35 UTC by Vangelis
Modified:	2012-12-27 08:37 UTC (History)
CC List:	3 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
One more of the PDFs fails to get indexed (365.56 KB, application/pdf) 2012-02-24 02:39 UTC, Vangelis	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Vangelis 2012-02-24 02:35:26 UTC

Version:           4.8 (using KDE 4.8.0) 
OS:                Linux

I have many pdf files, that they do not appear in the nepomuk search results.
I made many search queries using keywords I know these PDFs contain, or I even searched directly using their filename from within dolphin, krunner or nepoogle.
I removed my nepomuk database, I tried to index these PDF file using the nepomukindexer command, I have moved the around to different folders, but it is impossibly to get them indexed.

For some of these PDFs, nepomukindexer doesn't return anything (suppose that there is no error) but for some others different errors are returned. Nepomukindex exit status is always 0 though... I guess that's probably a different bug by itself. 

Reproducible: Always

Steps to Reproduce:
Try to index some pdf files using nepomukindexer command.
Then use nepoogle to see if they have entered the nepomuk database.

./nepoogle url:"file name of pdf.pdf"

One PDF surely cannot be indexed for me (but no error returned by nepomukindexer) is the MLN Manual you can get from this sourceforge link:
http://mln.sourceforge.net/doc/mln-manual.pdf

Actual Results:  
PDF file is not indexed, thus no results showing to queries related to that file.

Expected Results:  
All PDF files should be able to get indexed correctly by nepomuk.

There is a whole thread in KDE forums with lots of information and ways tried to solve this unsuccessfully.
http://forum.kde.org/viewtopic.php?f=154&t=99385

Comment 1 Vangelis 2012-02-24 02:39:22 UTC

Created attachment 69048 [details]
One more of the PDFs fails to get indexed

Comment 2 Stephan Olbrich 2012-04-21 10:10:42 UTC

I have the same problem.
With xmlindexer I get a lot of information about the pdfs (metadata and content), but nepomukindexer returns without printing anything.
Monitoring with
sopranocmd --dbus org.kde.NepomukStorage --model main monitor
shows nothing.

The files in question show nothing when opened in nepomukshell and show no hash in dolphin.

Comment 3 Luis Silva 2012-07-14 15:36:11 UTC

I guess that bugs #285128 and #234069 could be could be clusterd in this one. This is a problem with the strigi analyser. In the repo there are two branches with alternative analisers: "newPdfAnalyzer" and "popplerPdfAnalyzer". Although in incomplete state, both  these alternatives produce better results than the default pdf analiser.
Please, could any of the developers involved take a stab at pushing any of these alternatives as the default?

Comment 4 Vishesh Handa 2012-12-03 07:36:47 UTC

In KDE 4.10, we have moved away from Strigi and are using our own indexer based on poppler. I'm not marking this bug as fixed, as the indexer has not been thoroughly tested. It could still use some polish.

I'll mark this as fixed, when I have tested it adequately.

Comment 5 Vishesh Handa 2012-12-27 08:37:32 UTC

This new PDF analyzer works quite well :)