Version: 4.3.0 (using KDE 4.3.0) Compiler: gcc 4.3.2 Gentoo 4.3.2-r3 p1.6, pie-10.1.5 OS: Linux Installed from: Gentoo Packages Nepomuk/Strigi has indexed my home directory. When I search using either the nepomuksearch:/ protocol or krunner I do not get all the results I expect. There are never any pdf files which include the phrase being searched. I do get pdfs that contain the word in their filename, and I also get word and text files that contain the word inside. I believe that Nepomuk/Strigi is not indexing the contents of pdf files.
confirmed on Kubuntu 9.10 alpha5.
I can confirm this on Arch Linux with KDE 4.3.2, too. Searches with nepomuksearch:/ returns PDF filesonly according to their file name, while text files, MS Word documents, ODFs and HTML pages are also found if their content matches.
See http://sourceforge.net/tracker/index.php?func=detail&aid=1585381&group_id=171000&atid=856302. You need to make sure that pdftotext is installed on your system. And maybe ArchLinux and Kubuntu should make it a runtime dependancy of Strigi? But then there is also: http://sourceforge.net/tracker/index.php?func=detail&aid=1677725&group_id=171000&atid=856302 I would not know how to solve this. :(
I definitely have pdftotext installed on my system, but it seems like the second issue could be the problem. Most of the pdfs I read contain maths characters and so that could stuff up the conversion to UTF8 as mentioned in that second link. There could be a workaround which just extracts the usable text from pdftotext and discards any unrecognised symbols, rather than just failing.
I have to admit that I have many PDFs with nonstandard characters as well. However, this bug applies to all PDFs. I created a number of very simple ASCII-only PDFs and waited for Strigi to index them. The problem persists: I can find them by their filename, not their content. Also, on my system pdftotext is provided by poppler (ver. 10.7). Tracker also uses poppler's pdftotext and it sucessfully indexes my PDF files. Moreover, if I recall correctly, indexing PDFs with Nepomuk/Strigi used to work (before 4.3?). Thus, it cannot be pdftotext's fault.
I was suspecting that Strigi was maybe built in an environment without poppler and that there could be linking or building issues, since the package does not have a dependency on it. So just to make sure, I have rebuilt Strigi from the sources with poppler installed. However, the problem persists. And another finding, although this may be a different issue: Nepomuk/Strigi does not seem to index .odt files, either. I have literally thousands of them and cannot get a search result containing even one .odt file without the search term being part of the filename. Anyone else experiencing this?
I can confirm this behavior too. Nepomuk does not serach in contents of pdf. But it searches in MS-O 97/XP/2007 doc-files, but not in odt.
confirmed on kubuntu karmic, kde 4.3.4
Created attachment 38991 [details] xmlindexer.output
Since there is confirmation now, I have filed a separate bug report for Nepomuk not indexing the contents of files in OpenDocument format: https://bugs.kde.org/show_bug.cgi?id=218335
(In reply to comment #10) > Since there is confirmation now, I have filed a separate bug report for Nepomuk > not indexing the contents of files in OpenDocument format: > https://bugs.kde.org/show_bug.cgi?id=218335 And what about PDF? It has also been confirmed.
(In reply to comment #11) > And what about PDF? It has also been confirmed. You must have misunderstood what I was writing. This bug report is about PDFs. All I have done is open a separate report for OpenDocument files. That this bug report is marked as "unconfirmed" has nothing to do with it and probably means nothing. This bug is well-known to the Nepomuk developers.
(In reply to comment #12) > (In reply to comment #11) > > And what about PDF? It has also been confirmed. > You must have misunderstood what I was writing. This bug report is about PDFs. > All I have done is open a separate report for OpenDocument files. That this bug > report is marked as "unconfirmed" has nothing to do with it and probably means > nothing. This bug is well-known to the Nepomuk developers. Sorry, I was dumb. I can also confirm that it is not the fault of pdftotext, it successfully converted every pdf I tried.
The next version of Strigi will improve the pdf analysis. Is there any chance you could test the svn trunk of Strigi?
I can test the lastest openSuse Builds of KDESC 4.4
I tested with the latest svn trunk of strigi and the kdelibs and kdebase of KDE SC beta 2. PDF files are still not indexed. Just to make sure, I have created a new user with a lot of example files to make sure this is not a config file issue. (Moreover, the indexing of .odt files work now, as I noted in that bug report. Clearly, strigi does work other than for PDFs.)
I just retested with the RC2 packages provided by Archlinux. I can happily announce that PDF indexing works now. I believe the bug report can be closed. Can anyone confirm this?
Text is extracted from pdfs here. Closing.