Bug 203786 - nepomuk won't index the contents of pdf files
Summary: nepomuk won't index the contents of pdf files
Status: RESOLVED FIXED
Alias: None
Product: nepomuk
Classification: Miscellaneous
Component: general (show other bugs)
Version: unspecified
Platform: Gentoo Packages Linux
: NOR normal
Target Milestone: ---
Assignee: Sebastian Trueg
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-08-14 05:51 UTC by bonne
Modified: 2010-01-27 13:48 UTC (History)
6 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
xmlindexer.output (19.00 KB, text/plain)
2009-12-11 15:10 UTC, Roberto
Details

Note You need to log in before you can comment on or make changes to this bug.
Description bonne 2009-08-14 05:51:19 UTC
Version:           4.3.0 (using KDE 4.3.0)
Compiler:          gcc 4.3.2 Gentoo 4.3.2-r3 p1.6, pie-10.1.5
OS:                Linux
Installed from:    Gentoo Packages

Nepomuk/Strigi has indexed my home directory. 
When I search using either the nepomuksearch:/ protocol or krunner I do not get all the results I expect.
There are never any pdf files which include the phrase being searched. 
I do get pdfs that contain the word in their filename, and I also get word and text files that contain the word inside. 

I believe that Nepomuk/Strigi is not indexing the contents of pdf files.
Comment 1 Kornel Jahn 2009-09-09 15:33:05 UTC
confirmed on Kubuntu 9.10 alpha5.
Comment 2 mutlu inek 2009-10-07 21:29:33 UTC
I can confirm this on Arch Linux with KDE 4.3.2, too.

Searches with nepomuksearch:/ returns PDF filesonly according to their file name, while text files, MS Word documents, ODFs and HTML pages are also found if their content matches.
Comment 3 Sebastian Trueg 2009-10-08 12:17:07 UTC
See http://sourceforge.net/tracker/index.php?func=detail&aid=1585381&group_id=171000&atid=856302. You need to make sure that pdftotext is installed on your system. And maybe ArchLinux and Kubuntu should make it a runtime dependancy of Strigi?

But then there is also: http://sourceforge.net/tracker/index.php?func=detail&aid=1677725&group_id=171000&atid=856302

I would not know how to solve this. :(
Comment 4 bonne 2009-10-08 14:06:47 UTC
I definitely have pdftotext installed on my system, but it seems like the second issue could be the problem. Most of the pdfs I read contain maths characters and so that could stuff up the conversion to UTF8 as mentioned in that second link. 

There could be a workaround which just extracts the usable text from pdftotext and discards any unrecognised symbols, rather than just failing.
Comment 5 mutlu inek 2009-10-08 23:20:08 UTC
I have to admit that I have many PDFs with nonstandard characters as well. However, this bug applies to all PDFs. I created a number of very simple ASCII-only PDFs and waited for Strigi to index them. The problem persists: I can find them by their filename, not their content.

Also, on my system pdftotext is provided by poppler (ver. 10.7). Tracker also uses poppler's pdftotext and it sucessfully indexes my PDF files.

Moreover, if I recall correctly, indexing PDFs with Nepomuk/Strigi used to work (before 4.3?).

Thus, it cannot be pdftotext's fault.
Comment 6 mutlu inek 2009-10-09 01:27:37 UTC
I was suspecting that Strigi was maybe built in an environment without poppler and that there could be linking or building issues, since the package does not have a dependency on it. So just to make sure, I have rebuilt Strigi from the sources with poppler installed. However, the problem persists.

And another finding, although this may be a different issue: Nepomuk/Strigi does not seem to index .odt files, either. I have literally thousands of them and cannot get a search result containing even one .odt file without the search term being part of the filename.

Anyone else experiencing this?
Comment 7 Thomas Kamps 2009-11-20 10:44:34 UTC
I can confirm this behavior too.
Nepomuk does not serach in contents of pdf.
But it searches in MS-O 97/XP/2007 doc-files, but not in odt.
Comment 8 Roberto 2009-12-11 15:03:55 UTC
confirmed on kubuntu karmic, kde 4.3.4
Comment 9 Roberto 2009-12-11 15:10:02 UTC
Created attachment 38991 [details]
xmlindexer.output
Comment 10 mutlu inek 2009-12-11 23:01:28 UTC
Since there is confirmation now, I have filed a separate bug report for Nepomuk not indexing the contents of files in OpenDocument format: https://bugs.kde.org/show_bug.cgi?id=218335
Comment 11 Kornel Jahn 2009-12-12 09:56:52 UTC
(In reply to comment #10)
> Since there is confirmation now, I have filed a separate bug report for Nepomuk
> not indexing the contents of files in OpenDocument format:
> https://bugs.kde.org/show_bug.cgi?id=218335

And what about PDF? It has also been confirmed.
Comment 12 mutlu inek 2009-12-13 01:59:17 UTC
(In reply to comment #11)
> And what about PDF? It has also been confirmed.
You must have misunderstood what I was writing. This bug report is about PDFs. All I have done is open a separate report for OpenDocument files. That this bug report is marked as "unconfirmed" has nothing to do with it and probably means nothing. This bug is well-known to the Nepomuk developers.
Comment 13 Kornel Jahn 2009-12-13 08:49:27 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > And what about PDF? It has also been confirmed.
> You must have misunderstood what I was writing. This bug report is about PDFs.
> All I have done is open a separate report for OpenDocument files. That this bug
> report is marked as "unconfirmed" has nothing to do with it and probably means
> nothing. This bug is well-known to the Nepomuk developers.

Sorry, I was dumb. I can also confirm that it is not the fault of pdftotext, it successfully converted every pdf I tried.
Comment 14 Sebastian Trueg 2009-12-14 10:17:37 UTC
The next version of Strigi will improve the pdf analysis. Is there any chance you could test the svn trunk of Strigi?
Comment 15 Thomas Kamps 2009-12-14 10:40:08 UTC
I can test the lastest openSuse Builds of KDESC 4.4
Comment 16 mutlu inek 2009-12-21 04:12:23 UTC
I tested with the latest svn trunk of strigi and the kdelibs and kdebase of KDE SC beta 2.

PDF files are still not indexed.

Just to make sure, I have created a new user with a lot of example files to make sure this is not a config file issue. (Moreover, the indexing of .odt files work now, as I noted in that bug  report. Clearly, strigi does work other than for PDFs.)
Comment 17 mutlu inek 2010-01-23 07:53:30 UTC
I just retested with the RC2 packages provided by Archlinux. I can happily announce that PDF indexing works now. I believe the bug report can be closed.

Can anyone confirm this?
Comment 18 Sebastian Trueg 2010-01-27 13:48:31 UTC
Text is extracted from pdfs here. Closing.