Bug 320384 - Indexer loops on MSOFFICE files
Summary: Indexer loops on MSOFFICE files
Status: RESOLVED FIXED
Alias: None
Product: nepomuk
Classification: Miscellaneous
Component: fileindexer (show other bugs)
Version: git master
Platform: Compiled Sources Linux
: NOR normal
Target Milestone: ---
Assignee: Nepomuk Bugs Coordination
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-05-28 11:39 UTC by Hrvoje Senjan
Modified: 2013-08-29 23:48 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Fixes pageCount and wordCount problem (3.19 KB, application/octet-stream)
2013-05-28 12:18 UTC, Vishesh Handa
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hrvoje Senjan 2013-05-28 11:39:52 UTC
Happens wih powerpoint presentations, word docs, and also spreadsheets

nepomukfileindexer(25541)/nepomuk (strigi service) Nepomuk2::FileIndexingJob::start: Running "/usr/bin/nepomukindexer" "/home/hrvoje/Pictures/fotić/Vilijev doček 5.6.2010/New Microsoft Office PowerPoint Presentation.pptx"
nepomukindexer(3730)/nepomuk (library) Nepomuk2::ResourceManagerPrivate::_k_storageServiceInitialized: Nepomuk Storage service up and initialized.
nepomukindexer(3730)/nepomuk (strigi service) Nepomuk2::Indexer::indexFile:  QUrl( "nepomuk:/res/260dce6b-c1d8-4cb5-ab2a-80129f54306a" )  "application/vnd.openxmlformats-officedocument.presentationml.presentation"
nepomukstorage(25513)/nepomuk (storage service) Nepomuk2::DataManagementModel::storeResources:  MERGING FAILED!
nepomukstorage(25513)/nepomuk (storage service) Nepomuk2::DataManagementModel::storeResources: Setting error! "Invalid argument (1)": "<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#wordCount> has a rdfs:domain of <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#TextDocument>. <nepomuk:/res/260dce6b-c1d8-4cb5-ab2a-80129f54306a> only has the following types <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#FileDataObject>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#DataObject>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Document>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>"

Reproducible: Always

Steps to Reproduce:
Try to index mentioned filetypes
Actual Results:  
As noted

Expected Results:  
Files get indexed, and no looping
Comment 1 Vishesh Handa 2013-05-28 11:41:22 UTC
Confirmed.

I know about this. I've brought up the issue on the nepomuk mailing list. We can either change the ontologies or fix the indexer.

I should also add some code to make sure that faulty indexers do not cause the indexer to loop forever.
Comment 2 Hrvoje Senjan 2013-05-28 11:47:02 UTC
(In reply to comment #1)
> Confirmed.
> 
> I know about this. I've brought up the issue on the nepomuk mailing list. We
> can either change the ontologies or fix the indexer.
OK, i guess ontologies can be changed, as we already need new release for 4.11

> I should also add some code to make sure that faulty indexers do not cause
> the indexer to loop forever.
Idea(not sure if possible/how hard to implement): if indexer fails to index unchanged file for x times, stop trying to index it...
Comment 3 Vishesh Handa 2013-05-28 11:49:20 UTC
> > I should also add some code to make sure that faulty indexers do not cause
> > the indexer to loop forever.
> Idea(not sure if possible/how hard to implement): if indexer fails to index
> unchanged file for x times, stop trying to index it...

It's not too hard. I have been meaning to implement it for some time now. Maybe I'll do it this week.
Comment 4 Hrvoje Senjan 2013-05-28 12:07:51 UTC
(In reply to comment #3)
> > > I should also add some code to make sure that faulty indexers do not cause
> > > the indexer to loop forever.
> > Idea(not sure if possible/how hard to implement): if indexer fails to index
> > unchanged file for x times, stop trying to index it...
> 
> It's not too hard. I have been meaning to implement it for some time now.
> Maybe I'll do it this week.
Great! :-)

Also, at least for docx, nfo#pageCount needs adjusting:
<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#pageCount> has a rdfs:domain of <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#PaginatedTextDocument>. <nepomuk:/res/9b4a50a1-7c0b-4fe9-9146-1abf4588c09c> only has the following types <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#FileDataObject>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#DataObject>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Document>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>
Comment 5 Vishesh Handa 2013-05-28 12:18:50 UTC
Created attachment 80119 [details]
Fixes pageCount and wordCount problem

Could you please test? I don't seem to have many msoffice documents.
Comment 6 Hrvoje Senjan 2013-05-28 12:28:09 UTC
Sure, i'll test it ;-)
Comment 7 Hrvoje Senjan 2013-05-28 12:43:12 UTC
Yup, works :-) Just the page count seems to be always at 1
Comment 8 Vishesh Handa 2013-05-28 13:34:44 UTC
You'll need to file a separate bug for that and maybe attach a document I can use to test it out.

Committing the patch above.
Comment 9 Vishesh Handa 2013-05-28 13:58:12 UTC
Git commit fcb3df91a5824b25741042ef0837cf28665deeb3 by Vishesh Handa.
Committed on 28/05/2013 at 15:29.
Pushed by vhanda into branch 'master'.

Office2007Extractor: Only add pageCount and wordCount for documents

The ontologies do not support it for presentations and spreadsheets

M  +17   -14   services/fileindexer/indexer/office2007extractor.cpp

http://commits.kde.org/nepomuk-core/fcb3df91a5824b25741042ef0837cf28665deeb3
Comment 10 Stephan Diestelhorst 2013-08-22 09:19:00 UTC
This still seems to happen with the latest KDE 4.11, is the patch part of the release?  I think the functionality of skipping a file when choking on it earlier should be implemented, too.
Comment 11 Stephan Diestelhorst 2013-08-22 10:46:52 UTC
(In reply to comment #10)
> This still seems to happen with the latest KDE 4.11, is the patch part of
> the release?  I think the functionality of skipping a file when choking on
> it earlier should be implemented, too.

Just inspected the broken files (one PPTX and one ODT).  Both are corrupted files, properly detected by running the file command on them.  I have tried to add their names to the ignore list in Nepomuksettings, but that does not prevent them from being indexed.  Also renaming the files to foo.odt.BROKEN and then adding *.BROKEN to the ignore list does not help.
Comment 12 Christoph Feck 2013-08-29 23:48:30 UTC
Stefan, could you please report a new bug and attach the documents to reproduce there?