Bug 388761

Summary: Baloo search returns same deleted backup file multiple times, can't clear it
Product: [Frameworks and Libraries] frameworks-baloo Reporter: skierpage <skierpage>
Component: balooctlAssignee: Pinak Ahuja <pinak.ahuja>
Status: RESOLVED DUPLICATE    
Severity: normal CC: nate
Priority: NOR    
Version: 5.41.0   
Target Milestone: ---   
Platform: Fedora RPMs   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description skierpage 2018-01-10 01:35:37 UTC
I notice when I enter certain filenames in the Plasma desktop's Application Launcher "Click to search" field, I get a lot of duplicate results for vim backup files ending in '~'.

I can repeat this with baloosearch, for example:

% baloosearch "History of T460"
/media/Windows/Users/spage/Documents/computer_crap/History_of_T460_packages.txt
/media/Windows/Users/spage/Documents/computer_crap/History_of_T460_packages.txt~
... repeated 32 more times!
/media/Windows/Users/spage/Documents/computer_crap/History_of_T460_packages.txt~
/media/Windows/Users/spage/Documents/computer_crap/2016_T460_laptop.txt

(I apologize for my directory's bad language :-) .)

1. This file on a Windows NTFS drive, but I usually edit it from Linux. It was indexed because I added /media/Windows/Users/spage/Documents/ to ~/.cofing/baloofilerc.
2. This file does not exist any more (I must have disabled vim creating a '~' backup).
3. Baloo these days excludes files ending in '~', my ~/.config/baloofilerc contains exclude filters=... ,*~ ...
4. Maybe Baloo doesn't notice when a file is deleted, especially when it excludes it, so I tried to manually remove it from the index with
   % balooctl clear '/media/Windows/Users/spage/Documents/computer_crap/History_of_T460_packages.txt~'
which prints
   Could not stat file: /media/Windows/Users/spage/Documents/computer_crap/History_of_T460_packages.txt~
   File(s) cleared

But there's no change to Baloo search behavior, it still returns the same backup file 34 times in search results for terms in that file.

So I changed ~/.config/baloofilerc to allow indexing of files ending in ~, killed baloo including the undocumented  /usr/libexec/baloorunner process, restarted baloo, and retried
  % balooctl clear '/media/Windows/Users/spage/My Documents/computer_crap/History_of_T460_packages.txt~'

but despite saying "File(s) cleared", ... it's still in search results 34 times.

So I recreated the file containing just some dummy terms "INDEX THIS FILE baloo blorf". `baloosearch` does *not* find the new term "blorf" in this file, but terms from the old file contents still match the file 34 times.

If I use the undocumented command `balooshow -x /media/Windows/Users/spage/Documents/computer_crap/History_of_T460_packages.txt\~` it says "No index information found" after I clear the file, but gives me information about the file when I index it such as "File Name Terms: Fhistory Fof Fpackages Ft460 Ftxt history of packages t460 txt." However, I notice balooshow doesn't include Line Count or a list of indexed Terms.

So, after over an hour fiddling with this, there seem to be at least two bugs.
1. `balooctl clear foo.txt~`
does not in fact clear search term information for the file if Baloo no longer considers this a text file it should index.

2. `balooctl index foo.txt~`
prints misleading "File(s) indexed" even when the file is excluded from indexing, or is not considered a text file

I suspect the only way to get rid of these bogus multiple matches to one file in search results is to yet again give up and delete my 32,801 file 1.96 GB baloo index and rebuild it from scratch. I'm running `balooctl checkDb`, it has spent 15 minutes at "DocumentTermsDB check .." with one CPU core pegged.

I realize file indexing is hard and I appreciate baloo and its predecessor nepomuk when it works, but please improve baloo's software engineering.
* Document every utility.
* Make sure commands like "clear" and "index" accurately report what they're doing. They need to print things like "File metadata indexed, but file contents ignored due to <xyz>", "File excluded from indexing", "File exists but is not present in index", etc.
Comment 1 Nate Graham 2018-01-12 21:02:21 UTC

*** This bug has been marked as a duplicate of bug 353874 ***