410680 – KFileMetaData plain text extractor sometimes fails for non-UTF text files

Bug 410680 - KFileMetaData plain text extractor sometimes fails for non-UTF text files

Summary: KFileMetaData plain text extractor sometimes fails for non-UTF text files

Status:	CONFIRMED

Alias:	None

Product:	frameworks-kfilemetadata
Classification:	Frameworks and Libraries
Component:	general (show other bugs)
Version:	5.115.0
Platform:	Fedora RPMs Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Stefan Brüns

URL:
Keywords:

Duplicates (1):	440537 (view as bug list)
Depends on:
Blocks:

Reported:	2019-08-07 04:09 UTC by skierpage
Modified:	2024-03-18 14:13 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description skierpage 2019-08-07 04:09:47 UTC

SUMMARY
I realized `baloosearch TERM` wasn't returning a 750 kB HTML document that I knew contained TERM starting at byte offset 72,814. But it does work if TERM is nearer the start. I reproduced this with a 100 kB file, baloosearch doesn't return the file if TERM is beyond around 61,600 bytes from the start. I also reproduced with a big HTML file off the web.

STEPS TO REPRODUCE
1. Find a big HTML file (over 100 kB), look for a word that only appears near the end, or just insert <p>NOSUCHWORD</p> somewhere near the end of the file. (I found https://demo.borland.com/testsite/stadyn_largepagewithimages.html but got inconsistent results.)
2. Run `balooctl monitor` in a terminal
3. Copy the HTML file to a location that Baloo indexes, e.g. your home directory
3. After `balooctl monitor` reports it's Indexing: file, then Idle, enter `baloosearch NOSUCHWORD. E.g. I found (using `rg --byte-offset NOSUCHWORD`) that "SSLv3" first appears 85,249 bytes into that test file, and baloosearch doesn't return it.

OBSERVED RESULT
Baloo doesn't index words beyond "a certain point" in an HTML file.

EXPECTED RESULT
Baloo should index the entire file... except when it intentionally doesn't.

I found a five-year-old plasma-devel thread https://plasma-devel.kde.narkive.com/TJAmjxUb/baloo-not-indexing-everything-by-default in which someone suggested "Just index the first say 100 KiB or so of a file", I don't know if that was implemented. If it has been, there *MUST* be good documentation of this and logging and warnings when Baloo intentionally doesn't index part or all of a file. E.g. `balooshow path/to/file` could say "Large file, only the first 64 kiB of text in it was indexed."

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 5.15.5
KDE Frameworks Version: 5.59.0
Qt Version: 5.12.4, xcb

ADDITIONAL INFORMATION
There doesn't seem to be any way to run baloo_file_indexer yourself to find out what it gets from a file. Nor could I figure out what the baloo-widge baloo_filemetadata_temp_extractor does, or how to get useful logging of text extraction. This all makes debugging painful.

The Baloo source README.md says "Baloo relies on [KFileMetaData](https://api.kde.org/frameworks/kfilemetadata/html/index.html) to extract content from the files", so maybe the problem lies in that library. There's no specific extractor in either project for HTML files.

Comment 1 tagwerk19 2021-07-09 21:53:02 UTC

Looks like it's fixed along the way...

I can run a bash script
    for i in {1..1000000}; do echo "$i" >> largefile.txt; done
    echo "9999999" >> largefile.txt
wait for a few seconds and then
    baloosearch 9999999
which finds "largefile.txt"

For an html, the equiv:
    echo "<HTML><BODY>" > largefile.html
    for i in {1..1000000}; do echo "<P>$i</P>" >> largefile.html; done
    echo "<P>9999999</P>" >> largefile.html
    echo "</BODY></HTML>" >> largefile.html
and try
    baloosearch 9999999
again - shows largefile.html

It's possible that committing the changes from memory to disc takes time after "Balooctl monitor" says the file has been indexed. For the million-line-file, the baloo index is about 14 Mbyte.

Checked on
    Neon Unstable
    Plasma: 5.22.80
    Frameworks: 5.84.0
    Qt: 5.15.3

Comment 2 skierpage 2021-07-14 09:56:41 UTC

(In reply to tagwerk19 from comment #1)
> Looks like it's fixed along the way...

It works for your nifty test files, but my steps to reproduce still fail.

I wrote
> There doesn't seem to be any way to run baloo_file_indexer yourself to
> find out what it gets from a file. Nor could I figure out what the 
> baloo_filemetadata_temp_extractor does, or how to get 
> useful logging of text extraction. This all makes debugging painful.

You can use `balooshow -x path/to/file` to see what terms baloo_file indexed. For stadyn_largpagewithimages.html, it is very few words. Even words like "Design" and "Principles" which are in the first 2500 bytes!

I ran strace on baloo_file and its children. One of them opens and reads changed files, but it only read 16,384 bytes from this test file. However, that incomplete read should have included those words.

I guess I'll have to look at the file extractors' source code or somehow step through it in gdb. Do you know how to run the binaries by hand?

Comment 3 skierpage 2021-07-14 10:03:16 UTC

(In reply to tagwerk19 from comment #1)
> Looks like it's fixed along the way...
> ...
>     Frameworks: 5.84.0

I'm only on 5.83.0, so maybe this is fixed in a newer KDE Frameworks. But it doesn't look like there were significant changes in frameworks/baloo or frameworks/kfilemetadata projects.

Comment 4 tagwerk19 2021-07-14 17:01:59 UTC

(In reply to skierpage from comment #2)
> ... stadyn_largpagewithimages.html ...
I think I'd point a finger of doubt at a "Copyright Symbol" in the line:

    ? 1997 CheckFree Corp.

There's a plain A9 hex there, maybe a bit "old school". Try converting the file to unicode...

    iconv -f ISO-8859-1 -t utf-8 stadyn_largepagewithimages.html > test.html

Comment 5 skierpage 2021-07-14 23:02:51 UTC

(In reply to tagwerk19 from comment #4)
> (In reply to skierpage from comment #2)
> > ... stadyn_largpagewithimages.html ...
> ... There's a plain A9 hex there, maybe a bit "old school". Try converting the
> file to unicode...
> 
>     iconv -f ISO-8859-1 -t utf-8 stadyn_largepagewithimages.html > test.html

And terms indexed according to `balooshow -x` jumped from 129 words to 2671! You win teh InterWebz. Now "Design" and "Principles" are indexed 🎉 ...  but still not words later on like "SSLv3" and "CANPENDING". However, by laboriously strace --follow-forks of baloo_file , it seems some child process (baloo_file_extractor?) does read the entire UTF8 file's contents. I'll try to research that problem more.

I strace'd baloo_file of the original non-utf-8 files, and some child process does one 4096-byte read of the start of the file, then packs it in! That's why balooo indexed so few terms in the original files; I filed bug 439857.

Comment 6 tagwerk19 2021-07-15 06:33:43 UTC

(In reply to skierpage from comment #5)
> ... Now "Design" and "Principles" are indexed 🎉 ...  but
> still not words later on like "SSLv3" and "CANPENDING" ...
You don't get anything from:
    baloosearch SSLv3
maybe you are stumbling over "case issues"?

For me, if I add some unique text in at the end of the test-utf8.html file, baloo finds it...

> I strace'd baloo_file of the original non-utf-8 files, and some child
> process does one 4096-byte read of the start of the file, then packs it in!
> That's why balooo indexed so few terms in the original files; I filed bug
> 439857.
Yes, I'd say the indexer met a "non-valid character" and stopped. Best consider this file as messed up in terms of encoding 8-]

It has "charset" information in an HTTP-EQUIV header line - but commented out.

   <!--<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">  -->

However even if I take out the commenting to test whether the indexing recognises the HTTP-EQUIV, the indexing still fails.

I feel I don't always see all of baloo's error messages but the trick of running "balooctl purge" means that they start appearing on screen. I then get:

    Invalid encoding. Ignoring "/home/test/stadyn_largepagewithimages.html"

Ideally this file would be flagged "failed to index"

Comment 7 tagwerk19 2021-08-01 22:13:36 UTC

Revisiting, with the aim of seeing if there are any file size limits.

No problems up to about 10MB.
Above that, and it seems something of a "rough" number, the files are not indexed.

Tested for .txt and .html.

Seems it is not "terms are indexed up to the 10MB mark" but "if more than 10MB, don't index"

Comment 8 skierpage 2021-08-03 04:10:39 UTC

(In reply to tagwerk19 from comment #6)

> I feel I don't always see all of baloo's error messages but the trick of
> running "balooctl purge" means that they start appearing on screen. I then
> get:
> 
>     Invalid encoding. Ignoring "/home/test/stadyn_largepagewithimages.html"

I think that's related to whether you see `qDebug()` output, not restarting balooctl.

I filed bug 440537 that KFileMetadata's plaintext extractor should handle other character encodings.

> Ideally this file would be flagged "failed to index"
Great idea, file a bug. What happens is the contents of all the lines of the text file up to the one with the invalid character are indexed, so it's more "Incompletely indexed" (which is even more frustrating!).

(In reply to tagwerk19 from comment #7)
> No problems up to about 10MB.
> Above that, and it seems something of a "rough" number, the files are not
> indexed.

Yup, that is an undocumented limit for text files in Baloo's file processing. I added a new section https://community.kde.org/Baloo#Indexing_limitations with what I've learned. I still haven't figured out what goes wrong indexing terms far down in large-but-not-10MB UTF-8 files.

Comment 9 Stefan Brüns 2023-11-14 02:13:09 UTC

Can not be reproduced, provided link (https://demo.borland.com/testsite/stadyn_largepagewithimages.html) is dead.

Comment 10 tagwerk19 2023-11-14 08:16:03 UTC

(In reply to Stefan Brüns from comment #9)
> Can not be reproduced, provided link
> (https://demo.borland.com/testsite/stadyn_largepagewithimages.html) is dead.
Maybe on Archive.org
    https://web.archive.org/web/20131225011444/https://demo.borland.com/testsite/stadyn_largepagewithimages.html
Not bad, archived on Christmas :-)

However, if I trust my troubleshooting in Comment 4, the issue was a non-UTF Copyright symbol. It's possible that these "strange characters" are now caught by
    https://invent.kde.org/frameworks/baloo/-/merge_requests/87

Comment 11 Stefan Brüns 2024-03-05 15:41:24 UTC

If you want the original content (without Wayback inserts/link mangling), you have to use:

https://web.archive.org/web/20131225011444im_/https://demo.borland.com/testsite/stadyn_largepagewithimages.html

Comment 12 Stefan Brüns 2024-03-18 14:12:18 UTC

*** Bug 440537 has been marked as a duplicate of this bug. ***