SUMMARY KFileExtractor stops part-way through extracting text from non-UTF text files, including HTML files, leading to weird unexpected indexing behavior from Baloo (see bug 410680 comments). STEPS TO REPRODUCE 1. Do the mystery incantation to see qDebug() output. 2. Create a text file with iso-8859-1 encoding and a special non-ASCII character like © in it, e.g. using KWrite's File > Save As with Encoding 2b. Save an HTML file with a different encoding. 3. Either run KFileMetadata's "dump" test utility with ~/kde/build/frameworks/kfilemetadata/bin/dump -f path/to/myiso8859file.txt , or watch the output of `balooctl monitor`. OBSERVED RESULT You may see the debug output KFileMetaData::PlainTextExtractor::extract: Invalid encoding. Ignoring "/path/to/myiso8859file.txt" If Baloo indexes the location where you saved the file, only words up to the line with the non-Unicode character will be indexed. EXPECTED RESULT It's tricky, but it's entirely possible to detect the encoding of any text file and correctly index its contents instead of bailing. For an HTML file, it's arguably wrong to ignore a charset=ISO-8859-1" specification at the top. It seems plaintextextractor.cpp just uses a standard QTextCodec::codecForLocale(), and if codec->toUnicode() returns invalidChars, it just bails half-way through; I don't know what would happen if it kept going. SOFTWARE/OS VERSIONS Linux/KDE Plasma: KDE Plasma Version: 5.22.4 KDE Frameworks Version: 5.84.0 Qt Version: 5.15.2 (Wayland) ADDITIONAL INFORMATION The command `balooctl failed` needs to report these files as failing or incompletely indexed, that's a Baloo bug. I documented this in a new https://community.kde.org/Baloo#Indexing_limitations section.
In case the non-ASCII character (byte) still forms a valid UTF-8 code sequence (something which often appears when an UTF-8 text is interpreted as e.g. ISO8859-1), the file will still be extracted completely, though with some occasional incorrect codepoints. As this bug report does not contain anything on top of the information present in bug 410680 closing as duplicate. *** This bug has been marked as a duplicate of bug 410680 ***