Bug 438382 - after baloo indexes some files more than once you can't clean this up
Summary: after baloo indexes some files more than once you can't clean this up
Status: CONFIRMED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.82.0
Platform: Fedora RPMs Linux
: NOR normal
Target Milestone: ---
Assignee: Stefan BrΓΌns
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-10 03:33 UTC by skierpage
Modified: 2023-01-01 11:47 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description skierpage 2021-06-10 03:33:56 UTC
SUMMARY
Some baloo searches return the same file numerous times. Once this happens it seems impossible to clean up.

STEPS TO REPRODUCE
1. Run `balooctl monitor`
2. Find a file that's in baloo's index multiple times. I found some at random, then searched for a common term ("the") and sorted to find files indexed multiple times: in a terminal, enter `baloosearch the | sort | uniq -c | sort -nr | head -10`
3. Clear the file from Baloo with `balooctl clear /path/to/file`
4. Repeat the baloosearch
5. Make a backup of the file somewhere not indexed (e.g. /tmp) and delete the file on-disk with `rm`
6. Repeat the baloosearch
7. Copy the backup back to the file location.
8. Repeat the baloosearch

OBSERVED RESULT
Two files that I edit a lot in vim appear in baloosearch results 6 and 7 times respectively, I also found a few other text files indexed twice, plus I have a .xlsx spreadsheet that appears twice.
Running `balooctl clear /path/to/file` either does nothing or seems to remove one instance of the file in baloosearch results. Baloo doesn't realize the file is in its DB multiple times.
Deleting the file does not remove any results from baloosearch, and `balooctl monitor` doesn't output anything.
Restoring the deleted file (I copied the backup back to it) adds another copy of it to Baloo's index.

You can't run `balooctl clear /path/to/file.txt` if the file doesn't exist.

EXPECTED RESULT
Baloo should never return the same file multiple times.
Deleting a file on-disk should clear it from Baloo's index.
`balooctl clear` should remove every entry for the file in Baloo's index.
Maybe `balooctl clear` should work even if the file does not exist on-disk.


SOFTWARE/OS VERSIONS

Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 5.21.5
KDE Frameworks Version: 5.82.0
Qt Version: 5.15.2 on Wayland

ADDITIONAL INFORMATION
The files that appear in Baloo's index multiple times are all on a mounted NTFS volume that I told Baloo to index, but the behavior that deleting a file doesn't remove it from Baloo's index happens on an ext4 volume as well.

Once a file appears in Baloo search results more than once, I can make it appear N+1 times by deleting it on-disk and copying a backup; but this doesn't work if the file only appears once.

The files that appear in Baloo index multiple times for common words appear fewer times for other words that I added to them more recently.

https://community.kde.org/Baloo mentions a `balooctl checkDb` command that seems useful (and then cautions against running it), but balooctl no longer offers this subcommand.

I didn't try rebuilding baloo's index. Despite these glitches baloo has been well for me πŸ‘β€οΈ
Comment 1 tagwerk19 2021-06-10 09:59:24 UTC
(In reply to skierpage from comment #0)
> Some baloo searches return the same file numerous times.
The filesystem you are using is critical here, have a look at:
   https://bugs.kde.org/show_bug.cgi?id=402154#c12
Try checking the file with "stat" to see whether the device number / inode is "stable" across reboots.

Baloo relies on the device number / inode internally, if a file appears with a different ID, it's treated as a different file.

> ... make it appear N+1 times by deleting it on-disk and copying a backup
Maybe NTFS mounts don't have stable ID's. That would be an extra indication...

> Once this happens it seems impossible to clean up.
Yes, I have also noticed this. It seems that a file has to be "there" on disc for "balooctl clear" to work and also (maybe?) the stored device number / inode also has to match. It's worth following up.

> ... the behavior that deleting a file doesn't remove it from Baloo's index
> happens on an ext4 volume as well...
There are certainly times it doesn't, but I think mostly it does. It seems to be an issue "referred to" under different bugs, might be something of interest under Bug 353874, Bug 429006 and Bug 437754
Comment 2 skierpage 2021-06-13 00:33:56 UTC
(In reply to tagwerk19 from comment #1)
Thanks for responding πŸ€—.
> (In reply to skierpage from comment #0)
> > Some baloo searches return the same file numerous times.
> ...
> Baloo relies on the device number / inode internally, if a file appears with
> a different ID, it's treated as a different file.
Ding-ding, that's it. `baloosearch --id term` shows different IDs for the same path, e.g.
 % baloosearch --id FuelCellWorks
500d900000803 /media/Windows/Users/spage/Documents/ECO.txt
5be2000000803 /media/Windows/Users/spage/Documents/ECO.txt
8546e00000803 /media/Windows/Users/spage/Documents/ECO.txt
...

> > Once this happens it seems impossible to clean up.
> Yes, I have also noticed this. It seems that a file has to be "there" on
> disc for "balooctl clear" to work and also (maybe?) the stored device number
> / inode also has to match. It's worth following up.

I filed bug 438527 , and may have spotted a logic error in `balooctl clear` 😻.
I filed enhancement bug 438528 to add a `balooctl remove [ID...]` subcommand.
Comment 3 tagwerk19 2021-06-13 07:06:05 UTC
(In reply to skierpage from comment #2)
> Thanks for responding πŸ€—.
Thank you for taking the trouble to troubleshoot :-)

Could be that it'll take some time to clear baloo bugs. I just do checking and sorting. I don't think you'd be treading on toes if you submitted a patch...

Anyway, I'll flag as Confirmed...
Comment 4 tagwerk19 2021-06-13 07:20:48 UTC
(In reply to skierpage from comment #0)
> ... all on a mounted NTFS volume ...

> ... Restoring the deleted file (I copied the backup back to it) adds
> another copy of it to Baloo's index ...
Without running tests on an NTFS disc, that sounds like you get a new inode with every copy.

That is going to give baloo trouble...