Bug 320384

Summary:	Indexer loops on MSOFFICE files
Product:	[Unmaintained] nepomuk	Reporter:	Hrvoje Senjan <hrvoje.senjan>
Component:	fileindexer	Assignee:	Nepomuk Bugs Coordination <nepomuk-bugs>
Status:	RESOLVED FIXED
Severity:	normal	CC:	me, nepomuk-bugs, stephan.diestelhorst
Priority:	NOR
Version First Reported In:	git master
Target Milestone:	---
Platform:	Compiled Sources
OS:	Linux
Latest Commit:	http://commits.kde.org/nepomuk-core/fcb3df91a5824b25741042ef0837cf28665deeb3	Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	Fixes pageCount and wordCount problem

Description Hrvoje Senjan 2013-05-28 11:39:52 UTC

Happens wih powerpoint presentations, word docs, and also spreadsheets

nepomukfileindexer(25541)/nepomuk (strigi service) Nepomuk2::FileIndexingJob::start: Running "/usr/bin/nepomukindexer" "/home/hrvoje/Pictures/fotić/Vilijev doček 5.6.2010/New Microsoft Office PowerPoint Presentation.pptx"
nepomukindexer(3730)/nepomuk (library) Nepomuk2::ResourceManagerPrivate::_k_storageServiceInitialized: Nepomuk Storage service up and initialized.
nepomukindexer(3730)/nepomuk (strigi service) Nepomuk2::Indexer::indexFile:  QUrl( "nepomuk:/res/260dce6b-c1d8-4cb5-ab2a-80129f54306a" )  "application/vnd.openxmlformats-officedocument.presentationml.presentation"
nepomukstorage(25513)/nepomuk (storage service) Nepomuk2::DataManagementModel::storeResources:  MERGING FAILED!
nepomukstorage(25513)/nepomuk (storage service) Nepomuk2::DataManagementModel::storeResources: Setting error! "Invalid argument (1)": "<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#wordCount> has a rdfs:domain of <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#TextDocument>. <nepomuk:/res/260dce6b-c1d8-4cb5-ab2a-80129f54306a> only has the following types <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#FileDataObject>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#DataObject>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Document>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>"

Reproducible: Always

Steps to Reproduce:
Try to index mentioned filetypes
Actual Results:  
As noted

Expected Results:  
Files get indexed, and no looping

Comment 1 Vishesh Handa 2013-05-28 11:41:22 UTC

Confirmed.

I know about this. I've brought up the issue on the nepomuk mailing list. We can either change the ontologies or fix the indexer.

I should also add some code to make sure that faulty indexers do not cause the indexer to loop forever.

Comment 2 Hrvoje Senjan 2013-05-28 11:47:02 UTC

(In reply to comment #1)
> Confirmed.
> 
> I know about this. I've brought up the issue on the nepomuk mailing list. We
> can either change the ontologies or fix the indexer.
OK, i guess ontologies can be changed, as we already need new release for 4.11

> I should also add some code to make sure that faulty indexers do not cause
> the indexer to loop forever.
Idea(not sure if possible/how hard to implement): if indexer fails to index unchanged file for x times, stop trying to index it...

Comment 3 Vishesh Handa 2013-05-28 11:49:20 UTC

> > I should also add some code to make sure that faulty indexers do not cause
> > the indexer to loop forever.
> Idea(not sure if possible/how hard to implement): if indexer fails to index
> unchanged file for x times, stop trying to index it...

It's not too hard. I have been meaning to implement it for some time now. Maybe I'll do it this week.

Comment 4 Hrvoje Senjan 2013-05-28 12:07:51 UTC

(In reply to comment #3)
> > > I should also add some code to make sure that faulty indexers do not cause
> > > the indexer to loop forever.
> > Idea(not sure if possible/how hard to implement): if indexer fails to index
> > unchanged file for x times, stop trying to index it...
> 
> It's not too hard. I have been meaning to implement it for some time now.
> Maybe I'll do it this week.
Great! :-)

Also, at least for docx, nfo#pageCount needs adjusting:
<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#pageCount> has a rdfs:domain of <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#PaginatedTextDocument>. <nepomuk:/res/9b4a50a1-7c0b-4fe9-9146-1abf4588c09c> only has the following types <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#FileDataObject>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#DataObject>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Document>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>

Comment 5 Vishesh Handa 2013-05-28 12:18:50 UTC

Created attachment 80119 [details]
Fixes pageCount and wordCount problem

Could you please test? I don't seem to have many msoffice documents.

Comment 6 Hrvoje Senjan 2013-05-28 12:28:09 UTC

Sure, i'll test it ;-)

Comment 7 Hrvoje Senjan 2013-05-28 12:43:12 UTC

Yup, works :-) Just the page count seems to be always at 1

Comment 8 Vishesh Handa 2013-05-28 13:34:44 UTC

You'll need to file a separate bug for that and maybe attach a document I can use to test it out.

Committing the patch above.

Comment 9 Vishesh Handa 2013-05-28 13:58:12 UTC

Git commit fcb3df91a5824b25741042ef0837cf28665deeb3 by Vishesh Handa.
Committed on 28/05/2013 at 15:29.
Pushed by vhanda into branch 'master'.

Office2007Extractor: Only add pageCount and wordCount for documents

The ontologies do not support it for presentations and spreadsheets

M  +17   -14   services/fileindexer/indexer/office2007extractor.cpp

http://commits.kde.org/nepomuk-core/fcb3df91a5824b25741042ef0837cf28665deeb3

Comment 10 Stephan Diestelhorst 2013-08-22 09:19:00 UTC

This still seems to happen with the latest KDE 4.11, is the patch part of the release?  I think the functionality of skipping a file when choking on it earlier should be implemented, too.

Comment 11 Stephan Diestelhorst 2013-08-22 10:46:52 UTC

(In reply to comment #10)
> This still seems to happen with the latest KDE 4.11, is the patch part of
> the release?  I think the functionality of skipping a file when choking on
> it earlier should be implemented, too.

Just inspected the broken files (one PPTX and one ODT).  Both are corrupted files, properly detected by running the file command on them.  I have tried to add their names to the ignore list in Nepomuksettings, but that does not prevent them from being indexed.  Also renaming the files to foo.odt.BROKEN and then adding *.BROKEN to the ignore list does not help.

Comment 12 Christoph Feck 2013-08-29 23:48:30 UTC

Stefan, could you please report a new bug and attach the documents to reproduce there?