Bug 439857 - baloo only indexes first 4096 bytes of non-UTF-8 text and html files
Summary: baloo only indexes first 4096 bytes of non-UTF-8 text and html files
Status: RESOLVED NOT A BUG
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Baloo File Daemon (show other bugs)
Version: 5.83.0
Platform: Fedora RPMs Linux
: NOR major
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-14 22:57 UTC by skierpage
Modified: 2024-03-05 15:28 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description skierpage 2021-07-14 22:57:28 UTC
SUMMARY
Investigating bug 410680 , @tagwerk19 figured out that a problematic file had an ISO8859 copyright symbol at the start. By laboriously strace --follow-forks of baloo_file I determined that some child process (baloo_file_extractor?) reads the first 4096 bytes of the file, then packs it in. Sure enough, baloo_file only indexes terms that appear in the first 4096 bytes of the file.

This is terrible behavior for anyone relying on Baloo. Your file appears to be indexed with no errors, but baloo will only return it in certain search results. Until this is fixed (the bug may lie in frameworks/kfilemetadata) there _has_ to be some warning to this effect, both in documentation and in the operation of baloo_file.

STEPS TO REPRODUCE
0. Run `balooctl monitor`
1. Save an HTML file with non UTF-8 character near the start to a location that Baloo indexes. I used https://demo.borland.com/testsite/stadyn_largepagewithimages.html 
2. balooctl monitor should report "Indexing: /path/to/file.html"
3. Run `balooshow -x /path/to/file.html`
4. To prove only the first 4096 bytes are indexed, save them to new file_start.html (use vim's :goto 4096 to go to byte offset 4096).
5. Run `balooshow -x /path/to/file_start.html`

6. Repeat these steps with a text file. I saved the demo file as text in my browser.

OBSERVED RESULT
baloo only indexes terms found in the first 4096 bytes of the HTML and text file.
The output of `balooshow -x` on the shorter file includes exactly the same Terms: line.

EXPECTED RESULT
Baloo should index all text files and HTML files.
While this bug exists, better warnings and logging from `baloo_file` daemon and `balooctl monitor` are essential.

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: Fedora 34 KDE spin
KDE Plasma Version: 5.22.3
KDE Frameworks Version: 5.83.0
Qt Version: 5.12.2 on Wayland

ADDITIONAL INFORMATION
Text and HTML files encoded with other file encodings, that have invalid UTF-8 bytes, also probably trigger this bug.

Detecting a file's character encoding is hard, but browsers do it pretty well and have open-source implementations. Simply continuing to read and index the file despite any character encoding issues would be better.

It is very difficult to trace what's going on because baloo_file_extractor, baloo_filemetadata_temp_extractor, the kfileextractors, and the file indexing process in general are largely undocumented. This is mentioned in bug 398101 but it's a more extensive problem.

As a workaround you can convert files to utf8, @tagwerk19 suggests `iconv -f ISO-8859-1 -t utf-8 /path/to/file.extension > /path/to/file_utf8.ext`. There seem to be other bugs in indexing large files, see later comments in bug 410680.
Comment 1 Stefan Brüns 2023-11-14 02:10:06 UTC
You are mistaken. I reads the first 4096 bytes to get the correct mimetype based on the contents.

See https://doc.qt.io/qt-6/qmimedatabase.html#mimeTypeForData
Comment 2 Stefan Brüns 2024-03-05 15:28:50 UTC
Not a baloo bug. The text is not extracted by KFileMetaData's plaintextextractor, but for that a BR already exists.