Summary: | Baloo fails to index plain text files unless extension is .txt | ||
---|---|---|---|
Product: | [Frameworks and Libraries] frameworks-baloo | Reporter: | John Andersen <jsamyth> |
Component: | general | Assignee: | Vishesh Handa <me> |
Status: | RESOLVED FIXED | ||
Severity: | grave | CC: | aspotashev, bhush94, me, pinak.ahuja |
Priority: | NOR | ||
Version: | 5.19.0 | ||
Target Milestone: | --- | ||
Platform: | Other | ||
OS: | Linux | ||
Latest Commit: | https://quickgit.kde.org/?p=baloo.git&a=commit&h=06efd6c05c15a64b53daac9e598666af584488ec | Version Fixed In: | |
Sentry Crash Report: |
Description
John Andersen
2016-01-16 21:55:26 UTC
Error persists in Baloo 5.18 as well. Persists in Baloo 5.19.0 as well. There should be a method to white-list extensions one purposely wants to content-index, perhaps stored in baloofilerc. This is the intended behavior, for files having text/plain mimetype. This was done to avoid the mess caused by applications which have log files in directories that are indexed by baloo. Though text files with a valid extension like .md .markdown should still be indexed because they have the mimetype: text/markdown but right now they are also not being indexed because baloo is somehow misinterpreting the mimetype. I'm looking into it. (In reply to Pinak Ahuja from comment #3) > This is the intended behavior, for files having text/plain mimetype. This > was done to avoid the mess caused by applications which have log files in > directories that are indexed by baloo. > > Though text files with a valid extension like .md .markdown should still be > indexed because they have the mimetype: text/markdown but right now they are > also not being indexed because baloo is somehow misinterpreting the > mimetype. I'm looking into it. But this is fundamentally the wrong approach, as extensions have never been a significant part of linux, and are (by your own admission) unreliable indicator of file content. This isn't a case of Baloo "misinterpreting" anything. The link I posted indicates that mimetype of plaintext is arbitrarily rejected for indexing unless the extension is "txt" (and size less then 50K). When this was put in place (2 years ago) it was indicated as a temporary hack. Yet it still exists. There is no indication that this was the intended behavior, when the comments in the code clearly label it as some sort of short term hack. Someone chose to keep all plaintext out of baloo (a questionable decision at best,). Rather than doing this with blacklist/whitelist (exclude filters) to address problematic file types, all plaintext was summarily rejected unless extension was txt. If all plaintext is to be rejected then the rational thing to do is to honor a whitelist (include filters) to override this rejection. (I believe that USED TO EXIST, but was removed in the rush to simplify the control set). If, on the other hand only SOME plaintext files are problematic, those should be handled by the exclude filters. Right now, logs could be handled by exclude filters. There is no longer a whitelist capability. But even the exclude filters is totally ignored for plaintext documents. So significant functionality has been lost ostensibly just to avoid logs (which could have been avoided by the exclude filters). Look in app.cpp : https://code.woboq.org/qt5/kf5/baloo/src/file/extractor/app.cpp.html Look for the word HACK. John I am familiar with the code. The blacklist/whitelist filters are still there just have a look at ~/.config/baloofilerc Maybe I wasn't clear enough, but the misinterpretation part I was talking about is a separate thing which is somewhat related to this. I know it was a temporary workaround and maybe it's time for it to go. I've been testing locally by removing it seems to work fine on my system but people have different configs and files on there system. Let's try removing it and see how it goes for the next version. Fixed in commit https://quickgit.kde.org/?p=baloo.git&a=commit&h=06efd6c05c15a64b53daac9e598666af584488ec. Not sure why the bug wasn't autoclosed. I'll inform someone from the bugsquad to close this manually. Marking as fixed. Finally filtered down to both Manjaro and Opensuse, and working very well. (I use baloo search to manage a large software code base, and it was sorely missed when it stopped indexing source code due to the txt issue.) Thanks for your fine work. For those arriving here after searching for this problem I have one minor thing to add: The indexing of previously excluded text files with an extension of other than "txt" did not take place automatically. I had to do: "balooctl disable" followed by "balooctl enable" and now they are all indexed. Thanks again. |