Using Baloo 5.17 (Arch/Manjaro repositories) as well as Baloo 4.14 (Opensuse Repositories) baloo file content indexing has suddenly lost the ability to index the content of plain/text files unless the extension is .txt. I believe I have traced this to a so called "temporary" patch to /src/file/extractor/app.cpp on 8/May/2014. I rely on baloo indexing a large volume of source code, which cannot have extensions of .txt but which are plain text, and suddenly this capability has been taken away for the flimsiest of reasons. Reproducible: Always Steps to Reproduce: 1. In a working baloo environment create a file called delete-me.txt 2. Write some words in the file including the word uniqueword 3. Save the file to disk in your local directory 4. Copy delete-me.txt to delete-me.too 5. at command line type baloosearch uniqueword Actual Results: Only the file delete-me.txt will be listed by baloosearch Expected Results: Both files should be listed by baloosearch File names continue to be indexed, but content is not. This renders the Dolphin Content search useless for any work with files that are plain text without a ".txt" extension. This was supposed to be a temporary patch (clearly labeled a HACK in the original commit) SEE: http://webcache.googleusercontent.com/search?q=cache:LUTPrh1zmZ8J:r.git.net/kde-commits/2014-05/msg02993.html+&cd=4&hl=en&ct=clnk&gl=us See also http://bugs.kde.org/show_bug.cgi?id=332421 Test Environments: 1) Manjaro-kde (archlinux) 15.12 KDE Framework 5.17.0 2) Opensuse 13.2 Kde Platform Version 4.14.9 1)------------------------------ [jsa@ManjaroVM ~]$ baloosearch -v Baloo 5.17.0 2)----------------------------- jsa@poulsbo:~> baloosearch -v Qt: 4.8.6 KDE Development Platform: 4.14.9 Baloo Search: 0.1
Error persists in Baloo 5.18 as well.
Persists in Baloo 5.19.0 as well. There should be a method to white-list extensions one purposely wants to content-index, perhaps stored in baloofilerc.
This is the intended behavior, for files having text/plain mimetype. This was done to avoid the mess caused by applications which have log files in directories that are indexed by baloo. Though text files with a valid extension like .md .markdown should still be indexed because they have the mimetype: text/markdown but right now they are also not being indexed because baloo is somehow misinterpreting the mimetype. I'm looking into it.
(In reply to Pinak Ahuja from comment #3) > This is the intended behavior, for files having text/plain mimetype. This > was done to avoid the mess caused by applications which have log files in > directories that are indexed by baloo. > > Though text files with a valid extension like .md .markdown should still be > indexed because they have the mimetype: text/markdown but right now they are > also not being indexed because baloo is somehow misinterpreting the > mimetype. I'm looking into it. But this is fundamentally the wrong approach, as extensions have never been a significant part of linux, and are (by your own admission) unreliable indicator of file content. This isn't a case of Baloo "misinterpreting" anything. The link I posted indicates that mimetype of plaintext is arbitrarily rejected for indexing unless the extension is "txt" (and size less then 50K). When this was put in place (2 years ago) it was indicated as a temporary hack. Yet it still exists. There is no indication that this was the intended behavior, when the comments in the code clearly label it as some sort of short term hack. Someone chose to keep all plaintext out of baloo (a questionable decision at best,). Rather than doing this with blacklist/whitelist (exclude filters) to address problematic file types, all plaintext was summarily rejected unless extension was txt. If all plaintext is to be rejected then the rational thing to do is to honor a whitelist (include filters) to override this rejection. (I believe that USED TO EXIST, but was removed in the rush to simplify the control set). If, on the other hand only SOME plaintext files are problematic, those should be handled by the exclude filters. Right now, logs could be handled by exclude filters. There is no longer a whitelist capability. But even the exclude filters is totally ignored for plaintext documents. So significant functionality has been lost ostensibly just to avoid logs (which could have been avoided by the exclude filters). Look in app.cpp : https://code.woboq.org/qt5/kf5/baloo/src/file/extractor/app.cpp.html Look for the word HACK.
John I am familiar with the code. The blacklist/whitelist filters are still there just have a look at ~/.config/baloofilerc Maybe I wasn't clear enough, but the misinterpretation part I was talking about is a separate thing which is somewhat related to this. I know it was a temporary workaround and maybe it's time for it to go. I've been testing locally by removing it seems to work fine on my system but people have different configs and files on there system. Let's try removing it and see how it goes for the next version.
Fixed in commit https://quickgit.kde.org/?p=baloo.git&a=commit&h=06efd6c05c15a64b53daac9e598666af584488ec. Not sure why the bug wasn't autoclosed. I'll inform someone from the bugsquad to close this manually.
Marking as fixed.
Finally filtered down to both Manjaro and Opensuse, and working very well. (I use baloo search to manage a large software code base, and it was sorely missed when it stopped indexing source code due to the txt issue.) Thanks for your fine work. For those arriving here after searching for this problem I have one minor thing to add: The indexing of previously excluded text files with an extension of other than "txt" did not take place automatically. I had to do: "balooctl disable" followed by "balooctl enable" and now they are all indexed. Thanks again.