Bug 358098 - Baloo fails to index plain text files unless extension is .txt
Summary: Baloo fails to index plain text files unless extension is .txt
Status: RESOLVED FIXED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.19.0
Platform: Other Linux
: NOR grave
Target Milestone: ---
Assignee: Vishesh Handa
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-16 21:55 UTC by John Andersen
Modified: 2016-04-30 21:07 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description John Andersen 2016-01-16 21:55:26 UTC
Using Baloo 5.17 (Arch/Manjaro repositories) as well as Baloo 4.14 (Opensuse Repositories) 
baloo file content indexing has suddenly lost the ability to index the content of plain/text files unless the extension is .txt.

I believe I have traced this to a so called "temporary" patch to /src/file/extractor/app.cpp on 8/May/2014.

I rely on baloo indexing a large volume of source code, which cannot have extensions of .txt but which are plain text, and suddenly this capability has been taken away for the flimsiest of reasons. 


Reproducible: Always

Steps to Reproduce:
1.  In a working baloo environment create a file called delete-me.txt
2.  Write some words in the file including the word uniqueword 
3. Save the file to disk in your local directory
4. Copy delete-me.txt to delete-me.too
5. at command line type baloosearch uniqueword


Actual Results:  
Only the file delete-me.txt will be listed by baloosearch

Expected Results:  
Both files should be listed by baloosearch

File names continue to be indexed, but content is not.
This renders the Dolphin Content search useless for any work with files that are plain text without a ".txt" extension.    
This was supposed to be a temporary patch (clearly labeled a HACK in the original commit)
SEE: http://webcache.googleusercontent.com/search?q=cache:LUTPrh1zmZ8J:r.git.net/kde-commits/2014-05/msg02993.html+&cd=4&hl=en&ct=clnk&gl=us

See also http://bugs.kde.org/show_bug.cgi?id=332421

Test Environments:  
1) Manjaro-kde (archlinux) 15.12 KDE Framework 5.17.0 
2) Opensuse 13.2 Kde  Platform Version 4.14.9

1)------------------------------
[jsa@ManjaroVM ~]$ baloosearch -v
Baloo 5.17.0

2)-----------------------------
jsa@poulsbo:~> baloosearch -v
Qt: 4.8.6
KDE Development Platform: 4.14.9
Baloo Search: 0.1
Comment 1 John Andersen 2016-01-19 19:48:03 UTC
Error persists in Baloo 5.18 as well.
Comment 2 John Andersen 2016-02-21 23:10:56 UTC
Persists in Baloo 5.19.0 as well.

There should be a method to white-list extensions one purposely wants to content-index, perhaps stored in baloofilerc.
Comment 3 Pinak Ahuja 2016-03-12 20:23:28 UTC
This is the intended behavior, for files having text/plain mimetype. This was done to avoid the mess caused by applications which have log files in directories that are indexed by baloo.

Though text files with a valid extension like .md .markdown should still be indexed because they have the mimetype: text/markdown but right now they are also not being indexed because baloo is somehow misinterpreting the mimetype. I'm looking into it.
Comment 4 John Andersen 2016-03-12 22:04:06 UTC
(In reply to Pinak Ahuja from comment #3)
> This is the intended behavior, for files having text/plain mimetype. This
> was done to avoid the mess caused by applications which have log files in
> directories that are indexed by baloo.
> 
> Though text files with a valid extension like .md .markdown should still be
> indexed because they have the mimetype: text/markdown but right now they are
> also not being indexed because baloo is somehow misinterpreting the
> mimetype. I'm looking into it.

But this is fundamentally the wrong approach, as extensions have never been a significant part of linux, and are (by your own admission) unreliable indicator of file content.

This isn't a case of Baloo "misinterpreting" anything.  The link I posted indicates that mimetype of plaintext is arbitrarily rejected for indexing unless the extension is "txt" (and size less then 50K).  
When this was put in place (2 years ago) it was indicated as a temporary hack.  Yet it still exists.  There is no indication that this was the intended behavior, when the comments in the code clearly label it as some sort of short term hack.

Someone chose to keep all plaintext out of baloo (a questionable decision at best,).  Rather than doing this with blacklist/whitelist (exclude filters) to address problematic file types, all plaintext was summarily rejected unless extension was txt.

If all plaintext is to be rejected then the rational thing to do is to honor a whitelist (include filters) to override this rejection.  (I believe that USED TO EXIST, but was removed in the rush to simplify the control set).

If, on the other hand only SOME plaintext files are problematic, those should be handled by the exclude filters.

Right now, logs could be handled by exclude filters.
There is no longer a whitelist capability.
But even the exclude filters is totally ignored for plaintext documents.  

So significant functionality has been lost ostensibly just to avoid logs (which could have been avoided by the exclude filters).  

Look in app.cpp  :  https://code.woboq.org/qt5/kf5/baloo/src/file/extractor/app.cpp.html
Look for the word HACK.
Comment 5 Pinak Ahuja 2016-03-13 09:08:17 UTC
John I am familiar with the code. The blacklist/whitelist filters are still there just have a look at ~/.config/baloofilerc

Maybe I wasn't clear enough, but the misinterpretation part I was talking about is a separate thing which is somewhat related to this.

I know it was a temporary workaround and maybe it's time for it to go. I've been testing locally by removing it seems to work fine on my system but people have different configs and files on there system. Let's try removing it and see how it goes for the next version.
Comment 6 Boudhayan Gupta 2016-03-14 07:55:34 UTC
Fixed in commit https://quickgit.kde.org/?p=baloo.git&a=commit&h=06efd6c05c15a64b53daac9e598666af584488ec. Not sure why the bug wasn't autoclosed.

I'll inform someone from the bugsquad to close this manually.
Comment 7 Bhushan Shah 2016-03-14 08:02:52 UTC
Marking as fixed.
Comment 8 John Andersen 2016-04-30 21:07:43 UTC
Finally filtered down to both Manjaro and Opensuse, and working very well.  
(I use baloo search to manage a large software code base, and it was sorely missed when it stopped indexing source code due to the txt issue.)

Thanks for your fine work.

For those arriving here after searching for this problem I have one minor thing to add: The indexing of previously excluded text files with an extension of other than "txt" did not take place automatically.  

I had to do: "balooctl disable" followed by "balooctl enable" and now they are all indexed.  

Thanks again.