Happens wih powerpoint presentations, word docs, and also spreadsheets nepomukfileindexer(25541)/nepomuk (strigi service) Nepomuk2::FileIndexingJob::start: Running "/usr/bin/nepomukindexer" "/home/hrvoje/Pictures/fotić/Vilijev doček 5.6.2010/New Microsoft Office PowerPoint Presentation.pptx" nepomukindexer(3730)/nepomuk (library) Nepomuk2::ResourceManagerPrivate::_k_storageServiceInitialized: Nepomuk Storage service up and initialized. nepomukindexer(3730)/nepomuk (strigi service) Nepomuk2::Indexer::indexFile: QUrl( "nepomuk:/res/260dce6b-c1d8-4cb5-ab2a-80129f54306a" ) "application/vnd.openxmlformats-officedocument.presentationml.presentation" nepomukstorage(25513)/nepomuk (storage service) Nepomuk2::DataManagementModel::storeResources: MERGING FAILED! nepomukstorage(25513)/nepomuk (storage service) Nepomuk2::DataManagementModel::storeResources: Setting error! "Invalid argument (1)": "<http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#wordCount> has a rdfs:domain of <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#TextDocument>. <nepomuk:/res/260dce6b-c1d8-4cb5-ab2a-80129f54306a> only has the following types <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#FileDataObject>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#DataObject>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Document>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>" Reproducible: Always Steps to Reproduce: Try to index mentioned filetypes Actual Results: As noted Expected Results: Files get indexed, and no looping
Confirmed. I know about this. I've brought up the issue on the nepomuk mailing list. We can either change the ontologies or fix the indexer. I should also add some code to make sure that faulty indexers do not cause the indexer to loop forever.
(In reply to comment #1) > Confirmed. > > I know about this. I've brought up the issue on the nepomuk mailing list. We > can either change the ontologies or fix the indexer. OK, i guess ontologies can be changed, as we already need new release for 4.11 > I should also add some code to make sure that faulty indexers do not cause > the indexer to loop forever. Idea(not sure if possible/how hard to implement): if indexer fails to index unchanged file for x times, stop trying to index it...
> > I should also add some code to make sure that faulty indexers do not cause > > the indexer to loop forever. > Idea(not sure if possible/how hard to implement): if indexer fails to index > unchanged file for x times, stop trying to index it... It's not too hard. I have been meaning to implement it for some time now. Maybe I'll do it this week.
(In reply to comment #3) > > > I should also add some code to make sure that faulty indexers do not cause > > > the indexer to loop forever. > > Idea(not sure if possible/how hard to implement): if indexer fails to index > > unchanged file for x times, stop trying to index it... > > It's not too hard. I have been meaning to implement it for some time now. > Maybe I'll do it this week. Great! :-) Also, at least for docx, nfo#pageCount needs adjusting: <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#pageCount> has a rdfs:domain of <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#PaginatedTextDocument>. <nepomuk:/res/9b4a50a1-7c0b-4fe9-9146-1abf4588c09c> only has the following types <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#FileDataObject>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#DataObject>, <http://www.w3.org/2000/01/rdf-schema#Resource>, <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#Document>, <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#InformationElement>, <http://www.w3.org/2000/01/rdf-schema#Resource>
Created attachment 80119 [details] Fixes pageCount and wordCount problem Could you please test? I don't seem to have many msoffice documents.
Sure, i'll test it ;-)
Yup, works :-) Just the page count seems to be always at 1
You'll need to file a separate bug for that and maybe attach a document I can use to test it out. Committing the patch above.
Git commit fcb3df91a5824b25741042ef0837cf28665deeb3 by Vishesh Handa. Committed on 28/05/2013 at 15:29. Pushed by vhanda into branch 'master'. Office2007Extractor: Only add pageCount and wordCount for documents The ontologies do not support it for presentations and spreadsheets M +17 -14 services/fileindexer/indexer/office2007extractor.cpp http://commits.kde.org/nepomuk-core/fcb3df91a5824b25741042ef0837cf28665deeb3
This still seems to happen with the latest KDE 4.11, is the patch part of the release? I think the functionality of skipping a file when choking on it earlier should be implemented, too.
(In reply to comment #10) > This still seems to happen with the latest KDE 4.11, is the patch part of > the release? I think the functionality of skipping a file when choking on > it earlier should be implemented, too. Just inspected the broken files (one PPTX and one ODT). Both are corrupted files, properly detected by running the file command on them. I have tried to add their names to the ignore list in Nepomuksettings, but that does not prevent them from being indexed. Also renaming the files to foo.odt.BROKEN and then adding *.BROKEN to the ignore list does not help.
Stefan, could you please report a new bug and attach the documents to reproduce there?