Version: (using KDE 4.4.1) OS: Linux Installed from: openSUSE RPMs There is a weird bug concernig full text seach in *.PDF-files using nepomuk-strigi-services: All PDF-documents generated by the export function of openoffice.org-writer aren't being indexed by nepomuk-strigi and thus contents cannot be found! The involved OOo-generated PDF-files can be extracted by pdftotext. Other indexing and searching tools like beagle or recoll are able to handle these files correctly, which means they index all of the contents and offer these PDf-files when fulltext search matches the search term. Not affected by this bug are PDFs generated by e.g acrobat-distiller. It seems to be nepomuk-strigi's problem getting on with particular PDFs only as they are existent in OOo-generated ones.
The reason for this is that the current PDF analyzer in Strigi is very simple, and it assumes that the encoding of the document corresponds to ASCII in the ASCII range (i.e. it is an expansion of the ASCII charset). This assumption fails with documents that do not use any of the standard encodings. OpenOffice exporter produces such documents. I wrote a more advanced PDF analyzer before Christmas that handles all documents, but it seems none of the Strigi developers have time to evaluate the patch. As it seems this may not be fixed soon I thought I could at least tell you what is behind this issue.
Thank you very much,Tuukka, for your information about the technical backgrounds concerning this nasty bug! From my point of view it is a really essential desideratum beeing able to use a correctly working full text search in all kinds of PDFs. It's a very pity that your patch for strigi and all your efforts are not yet recognised as they should! I hope there will be more concentration on this problem and your contribution for a solution by the devs in future. Thanks a lot! At least it's good to have an idea of the reason for this fault.
*** This bug has been confirmed by popular vote. ***
Indexing PDF's in general tends to take a lot of time and demands extremely high system usage. I have a folder with 870 PDF's (books), and system couldn't index it overnight. This is from uptime, after reboot and back to indexing: [root@lappy bin]# uptime 08:43:00 up 7:29, 0 users, load average: 2.02, 1.93, 1.67 KDE 4.7.0 x86_64
*** Bug 274895 has been marked as a duplicate of this bug. ***
With latest trunk my libstreamanalyzer crashes on certain pdfs - every pdf created by LibreOffice crashes it every time. I don't have usable bt yet as I don't have debug symbols installed. Will do later today.
Git commit db8f01aaa990dfd625ab705f3f84bef4c6f85896 by Sebastian Trueg. Committed on 26/10/2011 at 13:00. Pushed by trueg into branch 'KDE/4.7'. Use pdftotext to create nie:plainTextContent for PDF files. Strigi's pdf handling is still a bit problematic. There is a lot of pdf files out there which it cannot handle. Thus, we now employ a hack and call pdftotext for each pdf file to extract the plain text. That way searching by content sould work nicely now. BUG: 231936 FIXED-IN: 4.7.3 M +33 -0 nepomuk/services/strigi/indexer/nepomukindexwriter.cpp M +3 -0 nepomuk/services/strigi/indexer/nepomukindexwriter.h http://commits.kde.org/kde-runtime/db8f01aaa990dfd625ab705f3f84bef4c6f85896
Git commit e1553328daafc03395077993f25ea025963bf3e4 by Sebastian Trueg. Committed on 26/10/2011 at 13:00. Pushed by trueg into branch 'master'. Use pdftotext to create nie:plainTextContent for PDF files. Strigi's pdf handling is still a bit problematic. There is a lot of pdf files out there which it cannot handle. Thus, we now employ a hack and call pdftotext for each pdf file to extract the plain text. That way searching by content sould work nicely now. BUG: 231936 FIXED-IN: 4.7.3 M +33 -0 nepomuk/services/fileindexer/indexer/nepomukindexwriter.cpp M +3 -0 nepomuk/services/fileindexer/indexer/nepomukindexwriter.h http://commits.kde.org/kde-runtime/e1553328daafc03395077993f25ea025963bf3e4
Git commit 888d234c1b59a7cdd2679a9f371d2d8a09860f90 by Sebastian Trueg. Committed on 26/10/2011 at 12:57. Pushed by trueg into branch 'master'. Use pdftotext to create nie:plainTextContent for PDF files. Strigi's pdf handling is still a bit problematic. There is a lot of pdf files out there which it cannot handle. Thus, we now employ a hack and call pdftotext for each pdf file to extract the plain text. That way searching by content sould work nicely now. BUG: 231936 M +33 -0 services/fileindexer/indexer/nepomukindexwriter.cpp M +3 -0 services/fileindexer/indexer/nepomukindexwriter.h http://commits.kde.org/nepomuk-core/888d234c1b59a7cdd2679a9f371d2d8a09860f90
After upgrading my system to KDE4.7.3 indexing of PDF-files now works really fine as far as I can see after doing some tests. Thank you *very* much for your work! But unfortunately another problem has arisen: Search for content in *.odt-files (LibreOfice/OpenOffice - textdocuments) doesn't work anymore. This is still fine in KDE4.6.5, but broken with KDE4.7.3 Steps to reproduce: Create a LibreOffice-textdocument with some content and save it in a folder being indexed by strigi. Save this doc as *.odt Save this doc as *.pdf Save this doc as *.doc Save this doc as *.txt Do a fulltext search (after waiting some time) via dolphin for a characteristic word of the textdocuments. Actual results: *.pdf, *.doc and *.txt files are found Expected results: *.odt schould be offered too.
Maybe the odt problem is related to bug 285834.
Some more experiences in combination with bug: https://bugs.kde.org/show_bug.cgi?id=285834 I have done a fresh install with openSUSE 12.1 (KDE 4.7.3) on a VM, so there is no potential burden resulting from an old home-directory. I have tested two scenarios: 1.) Test-textdocuments as mentioned above in comment #10 created by LibreOffice. Results in this case are the same as before - *.odt files are not found, on the other hand *.pdf, *.txt, *.rtf are being indexed and found. Looking at the file properties dialog (information tab, via dolphin) reveals the lack of content in the case of the *.odt file. So there is nothing (no text) that could be searched for. Maybe ODTs (which are in a kind of zip-format) are not being extracted properly by strigi??? 2.) I fetched the "buggy-test-file" (ODT) mentioned in bug 285834, which was created by calligra and which originally contains just an image in a frame, no text. I added some text and saved this document in different formats as before. Mysteriously the results differ from scenario no.1: *This* ODT is being indexed, file properties (info tab) show up content (words) and a fulltext search succeeds.
Created attachment 71061 [details] File to test the patch.
Comment on attachment 71061 [details] File to test the patch. I use Fedora 16 upgraded to version 4.7.4 (Kde). Since the first installation, with Kde 4.7.2, i was not able to indexing most pdf files. After the upgrade to 4.7.4, nothing is changed, most pdf files won't indexing. I join a sample pdf file. The program pdftotext works with it. Does the patch really works?
This bug is also reported in the RH Bugzilla for Fedora 16: https://bugzilla.redhat.com/show_bug.cgi?id=821213
Upgraded strigi to version 0.7.7 and now most pdf files are indexed.
Because of hicups in my old Nepomuk database I started with a clean one in KDE 4.9. pdftotext was keeping my fan very active and the process just keeped on for many hours and did not terminate until asked to do so via kill. Before killing I looked up the commandline of the process via 'cat /proc/15716/cmdline' the file which keeped pdftotext busy was a big correlation graph I created with R once. You can download the file from https://82.161.138.100/owncloud/public.php?service=files&token=9d0efbe6cc841c79bcf516eb41dc7e0ee240e6e5&file=/testsync/correlations.pdf (selfsigned https, so ignore the warning, file is ~ 40 MB). Side note, the whole indexing took a very long time (days on an thinkpad 201). It finished shortly after killing the pdftotext process. Might be coincidence, but I am wondering if one astray process can keep the indexer busy? Virtuoso version: 6.1.4+dfsg1-0ubuntu1 Kubuntu 12.04 with kubuntu backports-ppa https://launchpad.net/~kubuntu-ppa/+archive/backports?field.series_filter=precise