Bug 438455 - KFileMetadata does not support some Microsoft Office .doc file versions
Summary: KFileMetadata does not support some Microsoft Office .doc file versions
Status: RESOLVED UPSTREAM
Alias: None
Product: frameworks-kfilemetadata
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.82.0
Platform: Fedora RPMs Linux
: NOR normal
Target Milestone: ---
Assignee: Pinak Ahuja
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-11 09:01 UTC by skierpage
Modified: 2023-11-11 02:03 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
baloo test .doc (Libreoffice) (12.00 KB, application/wps-office.doc)
2022-11-21 10:09 UTC, Guido
Details
baloo test .doc WPS office (13.00 KB, application/wps-office.doc)
2022-11-21 10:10 UTC, Guido
Details
powerpoint by WPS (253.00 KB, application/wps-office.ppt)
2022-11-21 10:13 UTC, Guido
Details
powerpoint by libreoffice (449.50 KB, application/wps-office.ppt)
2022-11-21 10:14 UTC, Guido
Details
xls by libreoffice (5.50 KB, application/wps-office.xls)
2022-11-21 10:20 UTC, Guido
Details
xls by WPS office (15.00 KB, application/wps-office.xls)
2022-11-21 10:20 UTC, Guido
Details
Override.xml file to sidestep the .xls and .ppt baloo indexing issues. (7.18 KB, text/xml)
2022-11-21 22:46 UTC, tagwerk19
Details

Note You need to log in before you can comment on or make changes to this bug.
Description skierpage 2021-06-11 09:01:23 UTC
SUMMARY
`baloosearch` couldn't locate a word processing file with a term in it. It was a .doc file, not .docx or .odt.

STEPS TO REPRODUCE
1. In LibreOffice Writer, create a document containing just "baloopleaseindexme"
2. File > Save As in Word 97-2003 format as baloo_indexing_test.doc in some directory that Baloo indexes.
3. In a terminal, run `baloosearch baloopleaseindexme`
4. In a terminal, run `balooshow -x /path/to/baloo_indexing_test.doc

OBSERVED RESULT
The document contents aren't indexed, so baloosearch for the content fails.
balooshow doesn't list any words in the document, just
  Terms: Mapplication Mmsword T5 X19-0 X20-0


EXPECTED RESULT
baloo should index these files as it does .odt and .docx files.

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 
KDE Plasma Version: 5.21.5
KDE Frameworks Version: 5.82.0
Qt Version: 5.15.2 on Wayland

ADDITIONAL INFORMATION
There are tools to extract text from MSOffice files, e.g.
  % flatpak run org.libreoffice.LibreOffice --invisible --convert-to txt --outdir /tmp/ /path/to/baloo_indexing_test.doc
will convert a .doc file to .txt. And TDF/DocumentLiberation project offers introspection tools like mso-dumper's doc-dump which dumps in some weird XML format.

In the interim this limitation should be mentioned somewhere, but I can't see where Baloo describes the file types whose content it does index.

I don't know if Baloo indexes contents of other MS Office 1990-2000 formats. Again, I should have to create test files to find out, known limitations should be documented.
Comment 1 skierpage 2021-06-11 09:30:40 UTC
Does Baloo use KFileMetaData extractors?

https://invent.kde.org/frameworks/kfilemetadata/-/blob/master/src/extractors/officeextractor.cpp#L20 suggests that KFileMetaData relies on the external programs catdoc for application/msword, xls2csv for application/vnd.ms-excel, and catppt for application/vnd.ms-powerpoint. I have catdoc (and the others) installed, yet these .doc files didn't get indexed.

Maybe if the programs baloo_file_extractor and baloo_filemetadata_temp_extractor were documented, I could run them by hand and figure out what's going on.
Comment 2 skierpage 2021-06-12 09:06:47 UTC
So it turns out Baloo did and can index contents of other .doc files, e.g. external .doc files I received in 2016 and earlier, and `catdoc` displays their contents; but catdoc doesn't display anything for the contents of the recent .doc file I received or the .doc file generated by LibreOffice 7.1.3.2 that Baloo doesn't index. I couldn't find any Linux utility that identifies the version of the Word file format that a .doc file uses, or whether it's been saved with Word's "Fast Save" feature. The two failing documents contain the string "Microsoft Word-Dokument" near the front, whereas the working ones contain "Microsoft Word 9.0" or "Microsoft Word 97-2004 Document" near the end.

So the problem here seems to be with KFileMetaData and its use of catdoc. I couldn't find a bug that catdoc doesn't support some Word file formats; its maintainer's CVStrac is dead, the most active bug list seems to be Debian's bug tracker.
Comment 3 Guido 2022-11-18 18:43:24 UTC
The bug is still here in framework 5.100
Comment 4 Guido 2022-11-20 11:28:11 UTC
if the problem is catdoc,  antiword is a good alternative
Comment 5 Stefan Brüns 2022-11-21 01:44:50 UTC
Baloo uses kfilemetadata, and it clearly states it.

Without providing a example file, this is not reproducible, and nothing can be done to enhance the file type support.
Comment 6 Stefan Brüns 2022-11-21 01:48:06 UTC
Pinak has been inactive for years. Default assignee is broken.
Comment 7 tagwerk19 2022-11-21 08:37:15 UTC
(In reply to skierpage from comment #0)
> STEPS TO REPRODUCE
> 1. In LibreOffice Writer, create a document containing just
> "baloopleaseindexme"
> 2. File > Save As in Word 97-2003 format as baloo_indexing_test.doc in some
> directory that Baloo indexes.
> 3. In a terminal, run `baloosearch baloopleaseindexme`
> 4. In a terminal, run `balooshow -x /path/to/baloo_indexing_test.doc
Maybe LibreOffice Writer has been fixed, I've just followed the steps with

    Version: 7.3.7.2 / LibreOffice Community

on Neon testing, and I get:

    $ balooshow -x baloo_indexing_test.doc
    1437d40000fc01 64513 1325012 baloo_indexing_test.doc [/home/test/testfiles/baloo_indexing_test.doc]
            Mtime: 1669018797 2022-11-21T09:19:57
            Ctime: 1669018974 2022-11-21T09:22:54
            Cached properties:
                    Word Count: 1
                    Line Count: 1

    Internal Info
    Terms: Mapplication Mmsword T5 X19-1 X20-1 baloopleaseindexme
    File Name Terms: Fbaloo Fdoc Findexing Ftest
    XAttr Terms:
    lineCount: 1
    wordCount: 1

    $ baloosearch baloopleaseindexme
    /home/test/testfiles/baloo_indexing_test.doc
    Elapsed: 0.25022 msecs

I can probably look back at earlier releases and see if the behaviour has changed. Likely to be somewhat hit or miss though :-/
Comment 8 Guido 2022-11-21 10:02:59 UTC
(In reply to tagwerk19 from comment #7)
> (In reply to skierpage from comment #0)
> > STEPS TO REPRODUCE
> > 1. In LibreOffice Writer, create a document containing just
> > "baloopleaseindexme"
> > 2. File > Save As in Word 97-2003 format as baloo_indexing_test.doc in some
> > directory that Baloo indexes.
> > 3. In a terminal, run `baloosearch baloopleaseindexme`
> > 4. In a terminal, run `balooshow -x /path/to/baloo_indexing_test.doc
> Maybe LibreOffice Writer has been fixed, I've just followed the steps with
> 
>     Version: 7.3.7.2 / LibreOffice Community
> 
> on Neon testing, and I get:
> 
>     $ balooshow -x baloo_indexing_test.doc
>     1437d40000fc01 64513 1325012 baloo_indexing_test.doc
> [/home/test/testfiles/baloo_indexing_test.doc]
>             Mtime: 1669018797 2022-11-21T09:19:57
>             Ctime: 1669018974 2022-11-21T09:22:54
>             Cached properties:
>                     Word Count: 1
>                     Line Count: 1
> 
>     Internal Info
>     Terms: Mapplication Mmsword T5 X19-1 X20-1 baloopleaseindexme
>     File Name Terms: Fbaloo Fdoc Findexing Ftest
>     XAttr Terms:
>     lineCount: 1
>     wordCount: 1
> 
>     $ baloosearch baloopleaseindexme
>     /home/test/testfiles/baloo_indexing_test.doc
>     Elapsed: 0.25022 msecs
> 
> I can probably look back at earlier releases and see if the behaviour has
> changed. Likely to be somewhat hit or miss though :-/

No, it should show also the content (keywords) indexed.
Comment 9 Guido 2022-11-21 10:09:29 UTC
Created attachment 153915 [details]
baloo test .doc (Libreoffice)
Comment 10 Guido 2022-11-21 10:10:05 UTC
Created attachment 153916 [details]
baloo test .doc WPS office
Comment 11 Guido 2022-11-21 10:13:47 UTC
Created attachment 153917 [details]
powerpoint by WPS
Comment 12 Guido 2022-11-21 10:14:10 UTC
Created attachment 153918 [details]
powerpoint by libreoffice
Comment 13 Guido 2022-11-21 10:19:36 UTC
I attached some files (doc, ppt,xls), both from Libreoffice and WPS.
Their content are not indexed by Baloo.
Comment 14 Guido 2022-11-21 10:20:13 UTC
Created attachment 153919 [details]
xls by libreoffice
Comment 15 Guido 2022-11-21 10:20:30 UTC
Created attachment 153920 [details]
xls by WPS office
Comment 16 tagwerk19 2022-11-21 17:12:24 UTC
(In reply to Guido from comment #8)
> ... it should show also the content (keywords) indexed.
What I see with "balooshow -x" is:
>     Terms: Mapplication Mmsword T5 X19-1 X20-1 baloopleaseindexme
Where the "baloopleaseindexme" is the content.

... I think things are working here
Comment 17 Guido 2022-11-21 17:15:21 UTC
(In reply to tagwerk19 from comment #16)
> (In reply to Guido from comment #8)
> > ... it should show also the content (keywords) indexed.
> What I see with "balooshow -x" is:
> >     Terms: Mapplication Mmsword T5 X19-1 X20-1 baloopleaseindexme
> Where the "baloopleaseindexme" is the content.
> 
> ... I think things are working here

can you upload your file? I would like to test it
Comment 18 tagwerk19 2022-11-21 17:36:34 UTC
(In reply to Guido from comment #9)
> Created attachment 153915 [details]
> baloo test .doc (Libreoffice)
This is the 
    baloo_test_Libreoffice_7.4.2.3.doc
file and...

(In reply to Guido from comment #10)
> Created attachment 153916 [details]
> baloo test .doc WPS office
This is the 
    baloo_test_WPS_Office.doc
file

Start with checking mime types...

    $ kmimetypefinder baloo_test_Libreoffice_7.4.2.3.doc
    application/msword
    $ kmimetypefinder baloo_test_WPS_Office.doc
    application/msword

Both are "thought of" as MS word files....

If I set up debugging and move the two files to an indexed folder, I see:

    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Folder cache: std::vector("/home/test/testfiles/": included)
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Indexing 5660354579332097 "/home/test/testfiles/baloo_test_Libreoffice_7.4.2.3.doc" "application/msword"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.filemetadata: Fetching extractors for "application/msword"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Indexing 5674107064613889 "/home/test/testfiles/baloo_test_WPS_Office.doc" "application/msword"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.filemetadata: Fetching extractors for "application/msword"

and "balooshow -x" for each gives me:

    $ balooshow -x baloo_test_Libreoffice_7.4.2.3.doc
    141c100000fc01 64513 1317904 baloo_test_Libreoffice_7.4.2.3.doc [/home/test/testfiles/baloo_test_Libreoffice_7.4.2.3.doc]
            Mtime: 1669049328 2022-11-21T17:48:48
            Ctime: 1669049328 2022-11-21T17:48:48
            Cached properties:
                    Word Count: 84
                    Line Count: 4

    Internal Info
    Terms: 14 2022 5 5.0 5.100.0 83 Mapplication Mmsword T5 X19-84 X20-4 a addon an and announcement announcements announces are available commonly developers for frameworks friendly functionality https hyperlink improvements in introduction is kde libraries licensing making manner mature monday monthly needed november of org part peer planned predictable provide qt quick release releases reviewed see series terms tested the this to today variety well which wide with �
    File Name Terms: F7.4.2.3 Fbaloo Fdoc Flibreoffice Ftest
    XAttr Terms:
    wordCount: 84
    lineCount: 4

    $ balooshow -x baloo_test_WPS_Office.doc
    1428920000fc01 64513 1321106 baloo_test_WPS_Office.doc [/home/test/testfiles/baloo_test_WPS_Office.doc]
            Mtime: 1669049328 2022-11-21T17:48:48
            Ctime: 1669049328 2022-11-21T17:48:48
            Cached properties:
                    Word Count: 85
                    Line Count: 4

    Internal Info
    Terms: 14 2022 5 5.0 5.100.0 83 Mapplication Mmsword T5 X19-85 X20-4 a addon an and announcement announcements announces are available commonly developers for frameworks friendly functionality h https hyperlink improvements in introduction is kde libraries licensing making manner mature monday monthly needed november of org part peer planned predictable provide qt quick release releases reviewed see series terms tested the this to today variety well which wide with �
    File Name Terms: Fbaloo Fdoc Foffice Ftest Fwps
    XAttr Terms:
    wordCount: 85
    lineCount: 4

Again, it seems that this is OK.

I'm checked on a Neon Testing system with LibreOffice, presumably the LibreOffice from 22.04, installed.

... That's the good news.
Comment 19 Guido 2022-11-21 17:45:05 UTC
interesting enough, on my system all files are seen as wps office by kmimetypefinder.
I will try to remove the WPS mimetypes, or WPS itself.
Comment 20 tagwerk19 2022-11-21 17:50:54 UTC
(In reply to Guido from comment #11)
> Created attachment 153917 [details]
> powerpoint by WPS
That's the
    baloo_test_WPS.ppt
file....

(In reply to Guido from comment #12)
> Created attachment 153918 [details]
> powerpoint by libreoffice
... and the
    baloo_test_libreoffice.ppt

Again, try the mime types...

    $ kmimetypefinder baloo_test_WPS.ppt
    application/vnd.ms-powerpoint
    $ kmimetypefinder baloo_test_libreoffice.ppt
    application/vnd.ms-powerpoint

which look OK to an untutored eye. However for some reason baloo picks a more generic mimetype...

    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Indexing 5664426208328705 "/home/test/testfiles/baloo_test_WPS.ppt" "application/x-ole-storage"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.filemetadata: No extractor for "application/x-ole-storage"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Indexing 5691149494844417 "/home/test/testfiles/baloo_test_libreoffice.ppt" "application/x-ole-storage"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.filemetadata: No extractor for "application/x-ole-storage"

... and balooshow shows the "application/x-ole-storage" mimetype, not the content

    $ balooshow -x baloo_test_WPS.ppt
    141fc40000fc01 64513 1318852 baloo_test_WPS.ppt [/home/test/testfiles/baloo_test_WPS.ppt]
            Mtime: 1669049328 2022-11-21T17:48:48
            Ctime: 1669049328 2022-11-21T17:48:48

    Internal Info
    Terms: Mapplication Mole Mstorage Mx
    File Name Terms: Fbaloo Fppt Ftest Fwps
    XAttr Terms:

    $ balooshow -x baloo_test_libreoffice.ppt
    1438120000fc01 64513 1325074 baloo_test_libreoffice.ppt [/home/test/testfiles/baloo_test_libreoffice.ppt]
            Mtime: 1669049328 2022-11-21T17:48:48
            Ctime: 1669049328 2022-11-21T17:48:48

    Internal Info
    Terms: Mapplication Mole Mstorage Mx
    File Name Terms: Fbaloo Flibreoffice Fppt Ftest
    XAttr Terms:
Comment 21 Guido 2022-11-21 18:06:49 UTC
ok, I removed the WPS mimetypes and now


> kmimetypefinder '/run/media/guido/nvme1/baloo test/baloo_test_Libreoffice_7.4.2.3.doc'
application/msword

nevertheless baloo doesn't index it:

balooshow -x baloo_test_Libreoffice_7.4.2.3.doc
6d59800010305 66309 447896 baloo_test_Libreoffice_7.4.2.3.doc [/run/media/guido/nvme1/baloo test/baloo_test_Libreoffice_7.4.2.3.doc]
        Mtime: 1669025062 2022-11-21T11:04:22
        Ctime: 1669053488 2022-11-21T18:58:08
        Cached properties:
                Conto delle parole: 0
                Conteggio righe: 0

Informazioni interne
Termini: Mapplication Mmsword T5 X19-0 X20-0 
Termini di nome di file: F7.4.2.3 Fbaloo Fdoc Flibreoffice Ftest 
XAttr termini: 
lineCount: 0
wordCount: 0
Comment 22 tagwerk19 2022-11-21 18:12:55 UTC
(In reply to Guido from comment #14)
> Created attachment 153919 [details]
> xls by libreoffice
This is the
    baloo_test_libreoffice.xls
file...

(In reply to Guido from comment #15)
> Created attachment 153920 [details]
> xls by WPS office
... and the
    baloo_test_wps.xls

The mimetypes are...

    $ kmimetypefinder baloo_test_libreoffice.xls
    application/vnd.ms-excel
    $ kmimetypefinder baloo_test_wps.xls
    application/vnd.ms-excel

but, as with the .ppt files above, baloo treats the files as "application/x-ole-storage" and does find an extractor for them:

    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Indexing 5691454437522433 "/home/test/testfiles/baloo_test_libreoffice.xls" "application/x-ole-storage"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.filemetadata: No extractor for "application/x-ole-storage"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Indexing 5691463027457025 "/home/test/testfiles/baloo_test_wps.xls" "application/x-ole-storage"
    Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.filemetadata: No extractor for "application/x-ole-storage"
                                                                                                                         
With the "balooshow -x" results....

    $ balooshow -x baloo_test_libreoffice.xls
    1438590000fc01 64513 1325145 baloo_test_libreoffice.xls [/home/test/testfiles/baloo_test_libreoffice.xls]
            Mtime: 1669049328 2022-11-21T17:48:48
            Ctime: 1669049328 2022-11-21T17:48:48

    Internal Info
    Terms: Mapplication Mole Mstorage Mx
    File Name Terms: Fbaloo Flibreoffice Ftest Fxls
    XAttr Terms:

    $ balooshow -x baloo_test_wps.xls
    14385b0000fc01 64513 1325147 baloo_test_wps.xls [/home/test/testfiles/baloo_test_wps.xls]
            Mtime: 1669049991 2022-11-21T17:59:51
            Ctime: 1669049991 2022-11-21T17:59:51

    Internal Info
    Terms: Mapplication Mole Mstorage Mx
    File Name Terms: Fbaloo Ftest Fwps Fxls
    XAttr Terms:

It's possible to get kmimetypefinder to consider "just" the filename or "just" the content:

    $ kmimetypefinder -f baloo_test_libreoffice.xls
    application/vnd.ms-excel
    $ kmimetypefinder -c baloo_test_libreoffice.xls
    application/x-ole-storage

which suggests some confusion with priorities and "magic" in the mimetype database.
Comment 23 Stefan Brüns 2022-11-21 18:20:46 UTC
(In reply to tagwerk19 from comment #20)

>     $ kmimetypefinder baloo_test_libreoffice.ppt
>     application/vnd.ms-powerpoint

This only checcks for the filename:

$> echo "This is not a powerpoint document" > /tmp/foo.ppt
$> kmimetypefinder /tmp/foo.ppt 
application/vnd.ms-powerpoint
 
> which look OK to an untutored eye. However for some reason baloo picks a
> more generic mimetype...
> 
>     Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Indexing
> 5664426208328705 "/home/test/testfiles/baloo_test_WPS.ppt"
> "application/x-ole-storage"
>     Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.filemetadata: No
> extractor for "application/x-ole-storage"
>     Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.baloo: Indexing
> 5691149494844417 "/home/test/testfiles/baloo_test_libreoffice.ppt"
> "application/x-ole-storage"
>     Nov 21 17:59:51 testmc baloo_file_extractor[3120]: kf.filemetadata: No
> extractor for "application/x-ole-storage"
> 
> ... and balooshow shows the "application/x-ole-storage" mimetype, not the
> content

Bug in shared mime info, https://gitlab.freedesktop.org/xdg/shared-mime-info/

/usr/share/mime/packages/freedesktop.org.xml has :
  <mime-type type="application/msword">
    <sub-class-of type="application/x-ole-storage"/>

but application/vnd.ms-powerpoint has no sub-class-of. Dito for e.g. Access and Excel documents.
Comment 24 tagwerk19 2022-11-21 18:21:52 UTC
(In reply to Guido from comment #21)
> ok, I removed the WPS mimetypes and now ...
> ...
> lineCount: 0
> wordCount: 0
That doesn't look right somehow...

I have enabled debugging by creating a file
    ~/.config/QtProject/qtlogging.ini
containing
    [Rules]
    kf.filemetadata=true
    kf.baloo=true

and checked with journalctl for debug output, maybe you see something there...
Comment 25 Guido 2022-11-21 18:32:27 UTC
(In reply to tagwerk19 from comment #24)
> (In reply to Guido from comment #21)
> > ok, I removed the WPS mimetypes and now ...
> > ...
> > lineCount: 0
> > wordCount: 0
> That doesn't look right somehow...
> 
> I have enabled debugging by creating a file
>     ~/.config/QtProject/qtlogging.ini
> containing
>     [Rules]
>     kf.filemetadata=true
>     kf.baloo=true
> 
> and checked with journalctl for debug output, maybe you see something
> there...

I tried your suggestion, rebooted, stopped baloo, purged, reenabled but I have nothing in journald about indexing, only the message tha baloo is starting
Comment 26 tagwerk19 2022-11-21 18:49:24 UTC
(In reply to Stefan Brüns from comment #23)
> Bug in shared mime info, https://gitlab.freedesktop.org/xdg/shared-mime-info/
> 
> /usr/share/mime/packages/freedesktop.org.xml has :
>   <mime-type type="application/msword">
>     <sub-class-of type="application/x-ole-storage"/>
> 
> but application/vnd.ms-powerpoint has no sub-class-of. Dito for e.g. Access
> and Excel documents.
Oh dear ...

... I'm guessing that means an Override.xml file 8-/
Comment 27 tagwerk19 2022-11-21 22:35:05 UTC
(In reply to Guido from comment #25)
> (In reply to tagwerk19 from comment #24)
> I tried your suggestion, rebooted, stopped baloo, purged, reenabled but I
> have nothing in journald about indexing, only the message tha baloo is
> starting
I'll admit I've not fully understood how to get baloo to output debug messages.

My experience so far if that, having set up the qtlogging.ini file, and I do a 'balooctl purge' on a console, I get to see the warning/debug messages streamed to that console. I have recently found that if I redirect the stderr to /dev/null - with a 'balooctl purge 2> /dev/null' I see the messages in the journal.

I would love to know how properly to control this (Bug 460390)
Comment 28 tagwerk19 2022-11-21 22:46:31 UTC
Created attachment 153934 [details]
Override.xml file to sidestep the .xls and .ppt baloo indexing issues.

Attached an Override.xml file that adds the:
    <sub-class-of type="application/x-ole-storage"/>
lines for the "application/vnd.ms-powerpoint" and "application/vnd.ms-powerpoint" entries.

This would be copied, as root, to the
   /usr/share/mime/packages
folder (the one that contains the freedesktop.org.xml) and the mimetype database rebuilt:
   # update-mime-database -V /usr/share/mime

That worked for me.
Comment 29 tagwerk19 2022-11-21 22:52:26 UTC
(In reply to tagwerk19 from comment #28)
> That worked for me.
That should of course be...
    That worked for me, Thank you Stefan!
Comment 30 skierpage 2022-11-23 12:08:58 UTC
This bug report has gotten very hard to follow. But 
1. if I follow my own steps (with LibreOffice Writer 7.4.2.3), baloo doesn't index. 
2. if I download Guido's attachment 153915 [details]  baloo_test_Libreoffice_7.4.2.3.doc , baloo doesn't index.
3 if I download Guido's attachment 153916 [details] baloo_test_WPS_Office.doc, baloo does index.
4. I have old MS Office docs that baloo does index.

In all cases, the output of  `catdoc FILENAME` matches baloo's indexing behavior -- the files baloo doesn't index are the ones for which catdoc has no output is empty and its exit code is 69.
@tagwerk19, what are your results with attachment 153915 [details] ?

I wrote
> I couldn't find any Linux utility that identifies the version of the Word file format that a .doc file uses
`file FILENAME` gives a lot of info; the non-indexed LibreOffice documents have Code page -535. I don't know if this is significant. I stepped through catdoc with gdb and for my file it didn't find an oleEntry matching WordDocument and exited with error code 69.

It is unhelpful that kfilemetadata's officeextractor.cpp doesn't log when `catdoc`it fails to index anything!

kmimetypefinder identifies all of these .doc files as application/msword
Comment 31 skierpage 2022-11-24 10:29:43 UTC
@tagwerk19 , it looks like in #comment 18  you did try @Guido's file baloo_test_Libreoffice_7.4.2.3.doc , and according to `balooshow -x` it did index its terms. I thought maybe it's because you have a different `catdoc`, but Debian and Fedora use basically the same 0.95 version. So I'm confused. What does `catdoc baloo_test_Libreoffice_7.4.2.3.doc` output for you and what's its exit status?

(In reply to Guido from comment #4)
> if the problem is catdoc,  antiword is a good alternative
I wrote a hacky script that strips the "-s cp1252 -d utf8 -w'" arguments that kfilemetadata passes to catdoc and then execs `antiword` with the remaining arguments (I think just the path to the file to index). If I put that in /usr/local/bin/catdoc (so kfilemetadata finds it first). then baloo does index baloo_test_Libreoffice_7.4.2.3.doc , yay! However, antiword doesn't index a small .doc file like my one-word "baloopleaseindexme"; if run from the command line it prints "I'm afraid the text stream of this file is too small to handle."
Comment 32 tagwerk19 2022-12-01 17:58:57 UTC
(In reply to Stefan Brüns from comment #23)
> ... Bug in shared mime info, https://gitlab.freedesktop.org/xdg/shared-mime-info/
It looks like there is also .ppt and .xls mimetype info in /usr/share/mime/packages/libreoffice.xml. These are also without the:

    <sub-class-of type="application/x-ole-storage"/>

I don't know what happens when there are multiple, distinct, entries for a mime type - but Override.xml, https://bugs.kde.org/attachment.cgi?id=153934, seems to override both.
Comment 33 tagwerk19 2022-12-01 18:11:38 UTC
(In reply to skierpage from comment #31)
> ... What does `catdoc baloo_test_Libreoffice_7.4.2.3.doc` output for you ...
I've not tried catdoc as a command before, but as they say, every day a learning day :-)

On Neon Testing (rebased on Ubuntu 22.04) and catdoc 0.95

    $ catdoc baloo_test_Libreoffice_7.4.2.3.doc
    $ catdoc baloo_test_WPS_Office.doc

both worked and gave me the "KDE today announces..." text.

However on Fedora 37 and Manjaro, also with catdoc 0.95:

    $ catdoc baloo_test_WPS_Office.doc

worked but:

    $ catdoc baloo_test_Libreoffice_7.4.2.3.doc

gave nothing and I see the same as skierpage:
> ... the output of `catdoc FILENAME` matches baloo's indexing behavior

Where catdoc fails, I get the same:
> lineCount: 0
> wordCount: 0
as Guido (in Comment 21)
Comment 34 tagwerk19 2022-12-01 18:26:11 UTC
(In reply to skierpage from comment #0)
> ADDITIONAL INFORMATION
> There are tools to extract text from MSOffice files...
That is a good lead, thanks!

Looks like you can convert a doc to text with:

    $ libreoffice --headless --convert-to "txt:Text (encoded):UTF8" document.doc

or stream the text to stdout, minimally with:

    $ libreoffice --cat document.doc

but this can give some "extraneous" warning messages. I'm trying out:

    $ libreoffice --headless --safe-mode --cat document.doc

and:

    $ libreoffice --headless "-env:UserInstallation=file:///tmp/Baloo_Conversion_${USER}" --cat document.doc

It seems that this conversion ought work more generally but I get failures with .xls or .ppt files, maybe watch:

    https://bugs.documentfoundation.org/show_bug.cgi?id=150846
Comment 35 tagwerk19 2022-12-01 18:33:11 UTC
Finally, I certainly had issues with the mime type database. Following Stefan's, comment 23, suggestion fixed it for me. Looking at Neon Testing, Fedora 37 and Manjaro, they have the same issue, they all need the Override.xml.

The mime type fix is necessary but not sufficient.
Comment 36 tagwerk19 2022-12-01 18:56:05 UTC
Confirming...
Comment 37 skierpage 2022-12-02 07:22:40 UTC
> On Neon Testing (rebased on Ubuntu 22.04) and catdoc 0.95
> 
>     $ catdoc baloo_test_Libreoffice_7.4.2.3.doc
>     $ catdoc baloo_test_WPS_Office.doc
> 
> both worked and gave me the "KDE today announces..." text.
Thanks! I think I figured it out. Even though every distro and upstream are all at version 0.95, Debian has a patch to catdoc that fixes this bug https://bugs.debian.org/874048 (and carries some other catdoc patches), but upstream lacks it and so Fedora lacks it too. I filed https://bugzilla.redhat.com/2150140.

So the problem with LibreOffice .doc files on Fedora can be RESOLVED > UPSTREAM. This should be two bug reports, one for LibreOffice .doc files and another for the .ppt and .xls mimeinfo bug ; the current bug title doesn't match either problem.
Comment 38 tagwerk19 2022-12-02 10:01:07 UTC
(In reply to Stefan Brüns from comment #23)
> Bug in shared mime info, https://gitlab.freedesktop.org/xdg/shared-mime-info/
> 
> /usr/share/mime/packages/freedesktop.org.xml has :
>   <mime-type type="application/msword">
>     <sub-class-of type="application/x-ole-storage"/>
> 
> but application/vnd.ms-powerpoint has no sub-class-of. Dito for e.g. Access
> and Excel documents.
Reported upstream:
    https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues/190
Comment 40 Stefan Brüns 2023-11-11 02:03:18 UTC
Bugs in several upstream projects (catdoc, shared-mime-info), which should contain the fixes by now.