Bug 440537 - KFileMetaData plain text extractor sometimes fails for non-UTF text files
Summary: KFileMetaData plain text extractor sometimes fails for non-UTF text files
Status: RESOLVED DUPLICATE of bug 410680
Alias: None
Product: frameworks-kfilemetadata
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.84.0
Platform: Fedora RPMs Linux
: NOR normal
Target Milestone: ---
Assignee: Pinak Ahuja
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-08-03 04:06 UTC by skierpage
Modified: 2024-03-18 14:12 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description skierpage 2021-08-03 04:06:36 UTC
SUMMARY
KFileExtractor stops part-way through extracting text from non-UTF text files, including HTML files, leading to weird unexpected indexing behavior from Baloo (see bug 410680 comments).

STEPS TO REPRODUCE
1. Do the mystery incantation to see qDebug() output.
2. Create a text file with iso-8859-1 encoding and a special non-ASCII character like © in it, e.g. using KWrite's File > Save As with Encoding
2b. Save an HTML file with a different encoding.

3. Either run KFileMetadata's "dump" test utility with ~/kde/build/frameworks/kfilemetadata/bin/dump -f path/to/myiso8859file.txt , or watch the output of `balooctl monitor`.


OBSERVED RESULT
You may see the debug output
KFileMetaData::PlainTextExtractor::extract: Invalid encoding. Ignoring "/path/to/myiso8859file.txt"

If Baloo indexes the location where you saved the file, only words up to the line with the non-Unicode character will be indexed.

EXPECTED RESULT

It's tricky, but it's entirely possible to detect the encoding of any text file and correctly index its contents instead of bailing. For an HTML file, it's arguably wrong to ignore a charset=ISO-8859-1" specification at the top.

It seems plaintextextractor.cpp just uses a standard QTextCodec::codecForLocale(), and if codec->toUnicode() returns invalidChars, it just bails half-way through; I don't know what would happen if it kept going.

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: 
KDE Plasma Version: 5.22.4
KDE Frameworks Version: 5.84.0
Qt Version: 5.15.2 (Wayland)

ADDITIONAL INFORMATION
The command `balooctl failed` needs to report these files as failing or incompletely indexed, that's a Baloo bug.

I documented this in a new https://community.kde.org/Baloo#Indexing_limitations section.
Comment 1 Stefan Brüns 2024-03-18 14:12:18 UTC
In case the non-ASCII character (byte) still forms a valid UTF-8 code sequence (something which often appears when an UTF-8 text is interpreted as e.g. ISO8859-1), the file will still be extracted completely, though with some occasional incorrect codepoints.

As this bug report does not contain anything on top of the information present in bug 410680 closing as duplicate.

*** This bug has been marked as a duplicate of bug 410680 ***