Bug 427453 - DOCX content indexing not working
Summary: DOCX content indexing not working
Status: RESOLVED NOT A BUG
Alias: None
Product: frameworks-kfilemetadata
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.74.0
Platform: Arch Linux Linux
: NOR minor
Target Milestone: ---
Assignee: Stefan Brüns
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-08 15:20 UTC by Buovjaga
Modified: 2020-10-12 09:15 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
Example DOCX file (4.17 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2020-10-11 15:04 UTC, Buovjaga
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Buovjaga 2020-10-08 15:20:06 UTC
SUMMARY
Verifiably indexed DOCX files do not yield content search results

STEPS TO REPRODUCE
1. Create a DOCX file in some location that is indexed by Baloo
2. Write some unique string in the DOCX file
3. Index the file:
balooctl index /path/to/file.docx
4. Open KFind and in the content tab, search for the unique string

Observed on two different Arch Linux systems. There is also a forum topic about this from last year: https://forum.kde.org/viewtopic.php?f=154&t=161505

KDE Plasma Version: 5.19.5
KDE Frameworks Version: 5.74
Qt Version: 5.15.1
Comment 1 Wolfgang Bauer 2020-10-11 07:16:02 UTC
That's not really a bug I think.

KFind doesn't use baloo's search index (it predates baloo by far).

AFAIK, it doesn't have special support for certain file formats either, it basically does the same as the "grep" tool, i.e. search for the text verbatim in the file.
And as a DOCX file is compressed as ZIP, no text can be found of course.

So maybe this can be seen as enhancement request to support content search via baloo. I have no idea if that would fit into kfind's design though.
Comment 2 Buovjaga 2020-10-11 07:25:38 UTC
(In reply to Wolfgang Bauer from comment #1)
> That's not really a bug I think.
> 
> KFind doesn't use baloo's search index (it predates baloo by far).
> 
> AFAIK, it doesn't have special support for certain file formats either, it
> basically does the same as the "grep" tool, i.e. search for the text
> verbatim in the file.
> And as a DOCX file is compressed as ZIP, no text can be found of course.

Oh, that is surprising to hear. It does find text in ODF documents, which are compressed as ZIP as well.

Does Dolphin search use Baloo's index? It doesn't work either.
Comment 3 Buovjaga 2020-10-11 07:54:49 UTC
(In reply to Buovjaga from comment #2)
> Does Dolphin search use Baloo's index? It doesn't work either.

Yes, it uses Baloo: https://userbase.kde.org/Dolphin

I would rather change this to be about Baloo, sorry for the noise.
Comment 4 Buovjaga 2020-10-11 08:30:35 UTC
On second thought, I am closing this. I opened this to help someone else, but it seems Dolphin's content search is only broken on my system. Apparently the only problem on the original reporter's system was KFind, which we now learned should not even work with zipped files (although for some reason it does work with ODT on the reporter's system).
Comment 5 Stefan Brüns 2020-10-11 13:00:11 UTC
Dolphin uses baloo, baloo uses kfilemetadata, and kfilemetadata supports ODF and DOCX files.

Zipped files are supported when it is part of the file format itself, as is the case for the OpenDocuemnt and Microsoft Office formats. Other archives (zip or e.g. any tar.*) are not extracted.

As the generic structure of both is very similar (zip file + some XML), it is strange one works and the other not.

Please provide one of the files which does not work, if possible.
Comment 6 Stefan Brüns 2020-10-11 14:22:45 UTC
Example docx file for reproducing the issue required.
Comment 7 Buovjaga 2020-10-11 15:04:19 UTC
Created attachment 132276 [details]
Example DOCX file

Here it is. Any ideas on how I could check, why it does not work on my system, but work on the system of the other person?
Comment 8 Stefan Brüns 2020-10-11 22:10:00 UTC
KFM has no problem with the file, and baloo on my system has no problem finding it.

1. Check if any data can be extracted from the file:
  a) dolphin, information panel (F11) should show "words" and "pages"
  b) dolphin -> properties -> details

2. Check if baloo has stored the file information:
$> balooshow -x path/to/file
Comment 9 Buovjaga 2020-10-12 05:32:30 UTC
(In reply to Stefan Brüns from comment #8)
> KFM has no problem with the file, and baloo on my system has no problem
> finding it.
> 
> 1. Check if any data can be extracted from the file:
>   a) dolphin, information panel (F11) should show "words" and "pages"
>   b) dolphin -> properties -> details
> 
> 2. Check if baloo has stored the file information:
> $> balooshow -x path/to/file

Dolphin's info is showing the word and page count properly.

balooshow gives this:

Internal Info
Terms: Mapplication Mdocument Mofficedocument Mopenxmlformats Mvnd Mwordprocessingml T5 
File Name Terms: Fbalooindextest Fdocx 
XAttr Terms: 

Should the 'superduperuniquestring' appear there?
Comment 10 Stefan Brüns 2020-10-12 09:07:25 UTC
Thats just basic indexing information. Seems like the content indexer never ran. Whats the output of:
$> balooctl status <path/to/file>
Comment 11 Buovjaga 2020-10-12 09:15:03 UTC
(In reply to Stefan Brüns from comment #10)
> Thats just basic indexing information. Seems like the content indexer never
> ran. Whats the output of:
> $> balooctl status <path/to/file>

It was indexed. Now I tried it again in a directory with less files and Dolphin was able to find it. Maybe it was just taking too long to run :( Sorry for the noise again.