Bug 353512

Summary: Baloo considers docx, xlsx, pptx as application/zip and doesn't index them
Product: [Frameworks and Libraries] frameworks-baloo Reporter: Guido <guido.iodice>
Component: generalAssignee: Vishesh Handa <me>
Status: RESOLVED FIXED    
Severity: normal CC: aspotashev, jpdraw, pinak.ahuja, rdieter
Priority: NOR    
Version: 5.19.0   
Target Milestone: ---   
Platform: Arch Linux   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Guido 2015-10-03 23:41:30 UTC
[guido@guido-hp Documenti]$ balooshow -x *.docx
68695012602284036 2052 15994304 /home/guido/Documenti/cancella2.docx

Internal Info
Terms: Mapplication Mzip T1 
File Name Terms: Fcancella2 Fdocx cancella2 docx 
XAttr Terms: 

68706239646795780 2052 15996918 /home/guido/Documenti/cancella3.docx

Internal Info
Terms: Mapplication Mzip T1 
File Name Terms: Fcancella3 Fdocx cancella3 docx 
XAttr Terms: 

68682853549869060 2052 15991473 /home/guido/Documenti/demo.docx

Internal Info
Terms: Mapplication Mzip T1 
File Name Terms: Fdemo Fdocx demo docx 
XAttr Terms: 


Reproducible: Always
Comment 1 Vishesh Handa 2015-10-05 08:37:18 UTC
Confirmed :(
Comment 2 Guido 2015-10-11 14:33:45 UTC
Baloo 5.15 doesn't solve the bug.

I can add this:

1. baloo_file properly recognize them as openxml (or application/wps office if I install WPS office) files but baloo_file_extractor recognize them as zip (doesnt matter if docx is created by WPS, Libreoffice or MS Word). 
2. baloo_file_extractor recognize .doc created by WPS office as MS OLE files while properly recognize .doc created by libreoffice as Word documents.
Comment 3 Guido 2015-11-17 21:29:43 UTC
baloo 5.16 doesn't solve the problem
Comment 4 Guido 2015-12-21 00:26:37 UTC
baloo 5.17 doesnt solve the bug
Comment 5 Guido 2016-01-10 23:18:34 UTC
baloo 5.18 doesnt solve the bug
Comment 6 Guido 2016-02-16 23:27:00 UTC
baloo 5.19 doesnt solve the bug
Comment 7 jpdraw 2016-03-07 07:54:34 UTC
Same here, however not sure if the file is not indexed. When I add a new docx file, balooctl status will show that the new file was indexed, but it will not be found later.

Txt files and Doc files (after adding catdoc) work well.

Please fix this, it is extremely important to be able to search among docx files (75% of files that I received are in this format)

Thanks

Jose
Comment 8 Pinak Ahuja 2016-03-13 09:56:48 UTC
Git commit f7a045919c091925314b7ab3125c575884792048 by Pinak Ahuja.
Committed on 13/03/2016 at 09:53.
Pushed by pinakahuja into branch 'master'.

Check both, filename and filecontent to determine mimetype

Checking only filecontent is not enough for proper mimetypes
and can lead to strange mimetypes which mess up our content
indexing.
Reviewed by: Bhushan Shah <bshah@kde.org>, Boudhayan Gupta <bgupta@kde.org>

M  +1    -1    src/file/extractor/app.cpp
M  +1    -1    src/tools/balooctl/indexer.cpp
M  +1    -1    tests/file/indexerconfigtest.cpp

http://commits.kde.org/baloo/f7a045919c091925314b7ab3125c575884792048