Bug 231936

Summary: Desktop search (nepomuk, strigi) and PDF-files
Product: [Unmaintained] nepomuk Reporter: Christine Bona <Usenetmail.Christine>
Component: generalAssignee: Sebastian Trueg <sebastian>
Status: RESOLVED FIXED    
Severity: normal CC: addammo, CisBug, dmage, dns_hmpf, mbriza, me, mklapetek, octavsly, ojrajala, trueg, tuukka.verho, Usenetmail.Christine, woskimi
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: openSUSE   
OS: Linux   
Latest Commit: Version Fixed In: 4.7.3
Sentry Crash Report:
Attachments: File to test the patch.

Description Christine Bona 2010-03-23 20:17:43 UTC
Version:            (using KDE 4.4.1)
OS:                Linux
Installed from:    openSUSE RPMs

There is a weird bug concernig full text seach in *.PDF-files using nepomuk-strigi-services:

All PDF-documents generated by the export function of openoffice.org-writer aren't being indexed by nepomuk-strigi and thus contents cannot be found!

The involved OOo-generated PDF-files can be extracted by pdftotext. Other indexing and searching tools like beagle or recoll are able to handle these files correctly, which means they index all of the contents and offer these PDf-files when fulltext search matches the search term.

Not affected by this bug are PDFs generated by e.g acrobat-distiller.
It seems to be nepomuk-strigi's problem getting on with particular PDFs only as they are existent in OOo-generated ones.
Comment 1 Tuukka Verho 2011-05-07 21:32:29 UTC
The reason for this is that the current PDF analyzer in Strigi is very simple, and it assumes that the encoding of the document corresponds to ASCII in the ASCII range (i.e. it is an expansion of the ASCII charset). This assumption fails with documents that do not use any of the standard encodings. OpenOffice exporter produces such documents.

I wrote a more advanced PDF analyzer before Christmas that handles all documents, but it seems none of the Strigi developers have time to evaluate the patch. As it seems this may not be fixed soon I thought I could at least tell you what is behind this issue.
Comment 2 Christine Bona 2011-07-25 00:06:23 UTC
Thank you very much,Tuukka, for your information about the technical backgrounds concerning this
nasty bug! From my point of view it is a really essential desideratum beeing able to use a correctly working full text search in all kinds of PDFs.
It's a very pity that your patch for strigi and all your efforts are not yet recognised as they should!
I hope there will be more concentration on this problem and your contribution for a solution by the devs in future.
Thanks a lot! At least it's good to have an idea of the reason for this fault.
Comment 3 Hans-Dieter Schulze 2011-08-01 14:56:19 UTC
*** This bug has been confirmed by popular vote. ***
Comment 4 ojrajala 2011-08-22 19:46:24 UTC
Indexing PDF's in general tends to take a lot of time and demands extremely high system usage. I have a folder with 870 PDF's (books), and system couldn't index it overnight. This is from uptime, after reboot and back to indexing:

[root@lappy bin]# uptime
08:43:00 up  7:29,  0 users,  load average: 2.02, 1.93, 1.67

KDE 4.7.0 x86_64
Comment 5 Sebastian Trueg 2011-10-06 07:09:19 UTC
*** Bug 274895 has been marked as a duplicate of this bug. ***
Comment 6 Martin Klapetek 2011-10-06 07:13:35 UTC
With latest trunk my libstreamanalyzer crashes on certain pdfs - every pdf created by LibreOffice crashes it every time. I don't have usable bt yet as I don't have debug symbols installed. Will do later today.
Comment 7 Sebastian Trueg 2011-10-26 11:01:19 UTC
Git commit db8f01aaa990dfd625ab705f3f84bef4c6f85896 by Sebastian Trueg.
Committed on 26/10/2011 at 13:00.
Pushed by trueg into branch 'KDE/4.7'.

Use pdftotext to create nie:plainTextContent for PDF files.

Strigi's pdf handling is still a bit problematic. There is a lot of pdf
files out there which it cannot handle. Thus, we now employ a hack and
call pdftotext for each pdf file to extract the plain text. That way
searching by content sould work nicely now.

BUG: 231936

FIXED-IN: 4.7.3

M  +33   -0    nepomuk/services/strigi/indexer/nepomukindexwriter.cpp
M  +3    -0    nepomuk/services/strigi/indexer/nepomukindexwriter.h

http://commits.kde.org/kde-runtime/db8f01aaa990dfd625ab705f3f84bef4c6f85896
Comment 8 Sebastian Trueg 2011-10-26 11:01:44 UTC
Git commit e1553328daafc03395077993f25ea025963bf3e4 by Sebastian Trueg.
Committed on 26/10/2011 at 13:00.
Pushed by trueg into branch 'master'.

Use pdftotext to create nie:plainTextContent for PDF files.

Strigi's pdf handling is still a bit problematic. There is a lot of pdf
files out there which it cannot handle. Thus, we now employ a hack and
call pdftotext for each pdf file to extract the plain text. That way
searching by content sould work nicely now.

BUG: 231936

FIXED-IN: 4.7.3

M  +33   -0    nepomuk/services/fileindexer/indexer/nepomukindexwriter.cpp
M  +3    -0    nepomuk/services/fileindexer/indexer/nepomukindexwriter.h

http://commits.kde.org/kde-runtime/e1553328daafc03395077993f25ea025963bf3e4
Comment 9 Sebastian Trueg 2011-10-26 12:37:26 UTC
Git commit 888d234c1b59a7cdd2679a9f371d2d8a09860f90 by Sebastian Trueg.
Committed on 26/10/2011 at 12:57.
Pushed by trueg into branch 'master'.

Use pdftotext to create nie:plainTextContent for PDF files.

Strigi's pdf handling is still a bit problematic. There is a lot of pdf
files out there which it cannot handle. Thus, we now employ a hack and
call pdftotext for each pdf file to extract the plain text. That way
searching by content sould work nicely now.

BUG: 231936

M  +33   -0    services/fileindexer/indexer/nepomukindexwriter.cpp
M  +3    -0    services/fileindexer/indexer/nepomukindexwriter.h

http://commits.kde.org/nepomuk-core/888d234c1b59a7cdd2679a9f371d2d8a09860f90
Comment 10 Christine Bona 2011-11-05 16:36:53 UTC
After upgrading my system to KDE4.7.3 indexing of PDF-files now works really fine as far as I can see after doing some tests.
Thank you *very* much for your work!

But unfortunately another problem has arisen:
Search for content in *.odt-files (LibreOfice/OpenOffice - textdocuments) doesn't work anymore.
This is still fine in KDE4.6.5, but broken with KDE4.7.3 

Steps to reproduce:

Create a LibreOffice-textdocument with some content and save it in a folder being indexed by strigi.
Save this doc as *.odt
Save this doc as *.pdf
Save this doc as *.doc
Save this doc as *.txt

Do a fulltext search (after waiting some time) via dolphin for a characteristic word of the textdocuments.

Actual results:
*.pdf, *.doc and *.txt files are found

Expected results:
*.odt schould be offered too.
Comment 11 Sebastian Trueg 2011-11-16 09:28:24 UTC
Maybe the odt problem is related to bug 285834.
Comment 12 Christine Bona 2011-11-19 20:35:10 UTC
Some more experiences in combination with bug: https://bugs.kde.org/show_bug.cgi?id=285834

I have done a fresh install with openSUSE 12.1 (KDE 4.7.3) on a VM, so there is no potential burden resulting from an old home-directory.

I have tested two scenarios:

1.) Test-textdocuments as mentioned above in comment #10 created by LibreOffice.

Results in this case are the same as before - *.odt files are not found, on the other hand *.pdf, *.txt, *.rtf are being indexed and found.

Looking at the file properties dialog (information tab, via dolphin) reveals the lack of content in the case of the *.odt file. So there is nothing (no text) that could be searched for.
Maybe ODTs (which are in a kind of zip-format) are not being extracted properly by strigi???


2.) I fetched the "buggy-test-file" (ODT) mentioned in bug 285834, which was created by calligra and which originally contains just an image in a frame, no text.
I added some text and saved this document in different formats as before.

Mysteriously the results differ from scenario no.1:

*This* ODT is being indexed, file properties (info tab) show up content (words) and a fulltext search succeeds.
Comment 13 DE LEO Francesco 2012-05-13 07:21:15 UTC
Created attachment 71061 [details]
File to test the patch.
Comment 14 DE LEO Francesco 2012-05-13 07:22:21 UTC
Comment on attachment 71061 [details]
File to test the patch.

I use Fedora 16 upgraded to version 4.7.4 (Kde). Since the first installation, with Kde 4.7.2, i was not able to indexing most pdf files. After the upgrade to 4.7.4, nothing is changed, most pdf files won't indexing.
I join a sample pdf file. The program pdftotext works with it.

Does the patch really works?
Comment 15 Martin Bříza 2012-05-14 11:01:29 UTC
This bug is also reported in the RH Bugzilla for Fedora 16: https://bugzilla.redhat.com/show_bug.cgi?id=821213
Comment 16 DE LEO Francesco 2012-05-25 09:59:04 UTC
Upgraded strigi to version 0.7.7 and now most pdf files are indexed.
Comment 17 Dirk Sarpe 2012-08-06 08:45:01 UTC
Because of hicups in my old Nepomuk database I started with a clean one in KDE 4.9. pdftotext was keeping my fan very active and the process just keeped on for many hours and did not terminate until asked to do so via kill. Before killing I looked up the commandline of the process via 'cat /proc/15716/cmdline' the file which keeped pdftotext busy was a big correlation graph I created with R once.

You can download the file from https://82.161.138.100/owncloud/public.php?service=files&token=9d0efbe6cc841c79bcf516eb41dc7e0ee240e6e5&file=/testsync/correlations.pdf (selfsigned https, so ignore the warning, file is ~ 40 MB).

Side note, the whole indexing took a very long time (days on an thinkpad 201). It finished shortly after killing the pdftotext process. Might be coincidence, but I am wondering if one astray process can keep the indexer busy?

Virtuoso version: 6.1.4+dfsg1-0ubuntu1
Kubuntu 12.04 with kubuntu backports-ppa https://launchpad.net/~kubuntu-ppa/+archive/backports?field.series_filter=precise