Bug 353874 - Baloo does not remove deleted files from index
Summary: Baloo does not remove deleted files from index
Status: ASSIGNED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: unspecified
Platform: Kubuntu Linux
: HI normal
Target Milestone: ---
Assignee: Pinak Ahuja
URL:
Keywords:
: 374736 377302 388761 429006 457746 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-10-13 19:53 UTC by Ongun Kanat
Modified: 2024-04-14 02:03 UTC (History)
29 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ongun Kanat 2015-10-13 19:53:45 UTC
Baloo does not refresh its index when a file is deleted. Random old results are shown in krunner.


Reproducible: Always

Steps to Reproduce:
- Search Baloo index via baloosearch or krunner
- It shows some deleted files
- Click on any (any) result, error message that says file cannot be found displays OR try to access file with any command ('ls', 'cat',  etc.) No such file or directory returns

Actual Results:  
Deleted files should be removed from the baloo index.


Shell output
$ baloosearch ca | grep bundle
/home/ongun/Code/Web/12Deli/wp-includes/certificates/ca-bundle.crt
$ ls /home/ongun/Code/Web/12Deli/wp-includes/certificates/ca-bundle.crt
ls: cannot access /home/ongun/Code/Web/12Deli/wp-includes/certificates/ca-bundle.crt: No such file or directory
Comment 1 Nate Graham 2017-10-27 20:06:50 UTC
*** Bug 370429 has been marked as a duplicate of this bug. ***
Comment 2 Nate Graham 2017-10-27 20:07:02 UTC
*** Bug 373430 has been marked as a duplicate of this bug. ***
Comment 3 Nate Graham 2017-10-27 20:07:43 UTC
*** Bug 362226 has been marked as a duplicate of this bug. ***
Comment 4 Nate Graham 2017-10-27 20:14:51 UTC
*** Bug 377302 has been marked as a duplicate of this bug. ***
Comment 5 Nate Graham 2017-10-27 20:16:52 UTC
*** Bug 374736 has been marked as a duplicate of this bug. ***
Comment 6 Lukas Ba. 2017-11-01 19:31:38 UTC
Re-posting my comments from a duplicate bug here.

It is not possible to clear deleted files from the db, baloo returns the error:
"Could not stat file"

This is because of line 243 in main.cpp, where non-existing files are skipped.

We should be happy that we found a wrong record referring to a non-existing file in our db of files, and remove the wrong file record instead.

https://github.com/KDE/baloo/compare/master...vitamins:patch-1

Hmm it's not that simple, since the next check also fails if id is 0, and we seem to need the id to remove the record, but it is 0 for non-existing files.

tr.removeDocument(id)

Can we remove using only the url instead?

Clearing an existing file which is in an indexed path is also problematic, since it will get added back later on automatically, reverting the clear action.
Comment 7 Nate Graham 2018-01-12 21:02:21 UTC
*** Bug 388761 has been marked as a duplicate of this bug. ***
Comment 8 Kishore 2018-09-02 09:30:53 UTC
The only way to overcome this currently seems to be to reindex.

balooctl disable
balooctl enable
Comment 9 d0048 2019-01-01 03:41:22 UTC
Sometimes doing a `baloo check` resolve the random files that appear, sometimes it doesn't.
Comment 10 Igor Poboiko 2019-02-18 07:35:12 UTC
This should have been fixed by https://phabricator.kde.org/D15939 and commit https://phabricator.kde.org/R293:f8897a2511c4652c203bf25f6d788d0a698e4203

Feel free to reopen if this bug still affects you.
Comment 11 leftcrane 2021-05-08 01:15:11 UTC
Not fixed on 5.21.4

Moved a whole bunch of files to an external drive several days ago. Multiple reboots AND baloo enable/disable cycles later, the files are still showing up in Krunner.

I went ahead a purged the database with "balootcl --purge" and ... the nonexistent files are still being helpfully found by krunner/kickoff.
Comment 12 Nate Graham 2021-05-08 01:20:45 UTC
Baloo is a framework BTW, not a part of Plasma (5.21.4 is a plasma version)

If the index is removed entirely yet deleted files are still found in a search, then the fault is elsewhere, in whatever is caching the old content.
Comment 13 leftcrane 2021-05-08 01:29:05 UTC
Well the purge worked, after logout. So krunner/kickoff's only fault is - possibly - that they don't update results until you logout.

The bug is with baloo. Try it on your system, you should get a similar result.
Comment 14 leftcrane 2021-05-08 01:32:07 UTC
I should have checked the files directly from balooctl of course, but in all likelihood this is a baloo bug, given that purging the database worked.
Comment 15 tagwerk19 2021-06-05 21:16:44 UTC
(In reply to leftcrane from comment #11)
> Moved a whole bunch of files to an external drive several days ago. Multiple
> reboots AND baloo enable/disable cycles later, the files are still showing
> up in Krunner.
If "a whole bunch" is more than:
    sysctl fs.inotify.max_queued_events
baloo might not have seen the delete notifications.

It also seems that deletions are not finished when baloo is closed down (on a logout, say), the entries end up stuck in the index. 

Some extra info:
    https://bugs.kde.org/show_bug.cgi?id=437754#c1
Comment 16 leftcrane 2021-06-07 17:30:56 UTC
No, it's definitely less than that.

BTW, I saw two recent reports from reddit of the same problem.
Comment 17 trmdi 2021-06-09 01:52:43 UTC
When I download a file, baloosearch can't detect it until I manually index it with balooctl index. Is it a related bug, or does it need a delayed time before indexing new files ?
Comment 18 tagwerk19 2021-06-09 14:12:41 UTC
(In reply to trmdi from comment #17)
> When I download a file, baloosearch can't detect it until I manually index
> it with balooctl index. Is it a related bug, or does it need a delayed time
> before indexing new files ?
No, I don't think there's a delay.

Baloo should "pick up" a new or changed file immediately and put in the queue for "full text" indexing. If you do a "balooctl index ..." you are telling baloo to do it there and then.

If you are running on battery then I think baloo waits with the full text index until the machine's plugged back in. I think that it also "backs off" on indexing if it sees that the system is heavily used.

If baloo is not "immediately" noticing a new file then there's a bit of troubleshooting to do. If it doesn't notice a change but it is found with a "balooctl check", then it's worth looking at the inotify settings (particularly if you are using Neon and have loads of folders). See what
    sysctl fs.inotify.max_user_watches
says, if this is smaller than the number of folders you have then baloo won't see changes as they happen.

Note that "balooctl index ..." does a one-off indexing of the file, irrespective of whether it's in a folder you want indexed or not. External discs are, for example, not automatically indexed but a "balooctl index ..." would index a file on them.

There's a lot of bases to cover here, if the above doesn't help, maybe open a new bug and include all the details.
Comment 19 Dashon 2021-09-29 05:25:40 UTC
Still happening on Arch with kde plasma 5.22.5.
Only solutiom is to purge the index, reindex, then logout and back in.
Comment 20 Bug Janitor Service 2023-03-01 02:08:20 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/113
Comment 21 tagwerk19 2023-03-03 07:45:41 UTC
(In reply to Bug Janitor Service from comment #20)
> A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/113
From the MR:
> ... After I applied this patch, killed baloo_file, deleted an indexed file, and started baloo_file again,
> the deleted file didn't appear anymore in the balooseach results. That didn't happen with the
> unpatched baloo, the deleted file was still there and trying to open it with KRunner did nothing ...
So the test sequence is:

    Create a test file
    Check that it is indexed (including file content)
    Kill baloo
    Delete the file
    Restart baloo
    and check whether the file is still in the index.

Yes, the file is still in the index, as per baloosearch.

This is slightly more specific than comment 0 but consistent with what I've seen when having deleted a large folder and not waiting until baloo has cleared all its entries or if there are too many deletes and notifications "overflow", as mentioned in https://bugs.kde.org/show_bug.cgi?id=437754#c1

It might be worth mentioning a couple of the wraiths in the mist...

    Where the device number has changed and baloo reindexed the files, deleting the test file even
    when baloo_file is running will not result in the earlier entry being removed. This is in the cases
    with BTRFS and multiple subvols, such as with openSUSE, where "baloosearch -i searchstring"
    shows several hits with different DocIDs, see https://bugs.kde.org/show_bug.cgi?id=402154#c12

    There's also the possibility that krunner caches the data from baloo and presents remembered results...

Revisited with Neon Unstable:
    Plasma: 5.27.80
    Frameworks: 5.104.0
    Qt: 5.15.8
    Kernel: 5.19.0-35-generic (64-bit)
Comment 22 Méven 2023-05-08 09:26:27 UTC
Baloo should be be able to fix this using fanotify https://man7.org/linux/man-pages/man7/fanotify.7.html for any user with linux 5.1+.
Comment 23 tagwerk19 2023-05-08 10:33:46 UTC
(In reply to Méven from comment #22)
> Baloo should be be able to fix this using fanotify
> https://man7.org/linux/man-pages/man7/fanotify.7.html for any user with
> linux 5.1+.
I see a:
    Calling fanotify_init() requires the CAP_SYS_ADMIN capability.
presumably meaning fanotify needs admin rights.
Comment 24 Méven 2023-05-08 11:03:02 UTC
(In reply to tagwerk19 from comment #23)
> (In reply to Méven from comment #22)
> > Baloo should be be able to fix this using fanotify
> > https://man7.org/linux/man-pages/man7/fanotify.7.html for any user with
> > linux 5.1+.
> I see a:
>     Calling fanotify_init() requires the CAP_SYS_ADMIN capability.
> presumably meaning fanotify needs admin rights.

It seems to me that's not what man fanotify documentation says.
The example does not make use of it either.
It mention fanotify should not be run with CAP_SYS_ADMIN or unprivileged users would have access to more than they should.
Comment 25 Méven 2023-05-08 11:37:21 UTC
(In reply to Méven from comment #24)
> (In reply to tagwerk19 from comment #23)
> > (In reply to Méven from comment #22)
> > > Baloo should be be able to fix this using fanotify
> > > https://man7.org/linux/man-pages/man7/fanotify.7.html for any user with
> > > linux 5.1+.
> > I see a:
> >     Calling fanotify_init() requires the CAP_SYS_ADMIN capability.
> > presumably meaning fanotify needs admin rights.
> 
> It seems to me that's not what man fanotify documentation says.
> The example does not make use of it either.
> It mention fanotify should not be run with CAP_SYS_ADMIN or unprivileged
> users would have access to more than they should.

Sorry you are right https://man7.org/linux/man-pages/man2/fanotify_init.2.html

The API does need CAP_SYS_ADMIN.

So baloo could achieve this using an external root-owned with sticky bit exec whose only role would be to send to baloo files changes in index directories.
Comment 26 Oded Arbel 2023-05-08 12:14:17 UTC
(In reply to Méven from comment #25)
> The API does need CAP_SYS_ADMIN.

This was indeed true, up until Linux 5.12: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/notify/fanotify/fanotify_user.c?h=v5.12#n923

Since Linux 5.13, `CAP_SYS_ADMIN` is no longer required and instead just limits the flags you can use and the behavior you can expect: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/notify/fanotify/fanotify_user.c?h=v5.13#n1044

With 5.13 and later, without `CAP_SYS_ADMIN` you cannot set fanotify for filesystem/mount wide marks and you can only get events with a file descriptor (that you can use, AFAIU): https://patchwork.kernel.org/project/linux-fsdevel/patch/20210524135321.2190062-1-amir73il@gmail.com/

I believe this should still be good enough for Baloo's purposes, as we are only expecting Baloo to use FAN_MARK_INODE for directories listed "File Search" configuration.
Comment 27 tagwerk19 2023-05-08 13:13:22 UTC
(In reply to Oded Arbel from comment #26)
> ... I believe this should still be good enough for Baloo's purposes ...
What does fanotify beyond inotify?

I know deleting "too many things all at once" can overflow the inotify queue. Also that when unpacking a .tar folders can be created and files extracted into them faster than baloo can set up notify watches on the folders. It used to be that max_user_watches was too small (on some distributions) but I think no longer a problem.
Comment 28 Oded Arbel 2023-05-08 15:07:10 UTC
(In reply to tagwerk19 from comment #27)
> (In reply to Oded Arbel from comment #26)
> > ... I believe this should still be good enough for Baloo's purposes ...
> What does fanotify beyond inotify?

> Also that when unpacking a .tar folders can be created and files
> extracted into them faster than baloo can set up notify watches on the
> folders.

fanotify allows you to ignore that issue by setting up one watch on each of the folders configured in Baloo's KCM and that's it - there are no more race conditions between applications creating folders and Baloo putting inotify watches on them.

> I know deleting "too many things all at once" can overflow the inotify
> queue.

That is possibly still going to be an issue with fanotify - the default event queue with fanotify is 16384 events, and without `CAP_SYS_ADMIN` you can't increase that size - thought it is likely that baloo can consume events fast enough for this not to be a serious issue.
Comment 29 Igor Poboiko 2023-05-08 19:33:23 UTC
(In reply to Oded Arbel from comment #28)
>[...] 
> fanotify allows you to ignore that issue by setting up one watch on each of
> the folders configured in Baloo's KCM and that's it - there are no more race
> conditions between applications creating folders and Baloo putting inotify
> watches on them.

Sorry, I will add my 50 cents here. `man fanotify` claims that watches are not recursive, and should be set up for subdirectories separately, so such race condition is still there. Those could have been avoided if we could put a mark for the whole mount point / tree, but AFAIK that requires CAP_SYS_ADMIN.
Comment 30 Méven 2023-05-10 10:36:03 UTC
(In reply to Igor Poboiko from comment #29)
> (In reply to Oded Arbel from comment #28)
> >[...] 
> > fanotify allows you to ignore that issue by setting up one watch on each of
> > the folders configured in Baloo's KCM and that's it - there are no more race
> > conditions between applications creating folders and Baloo putting inotify
> > watches on them.
> 
> Sorry, I will add my 50 cents here. `man fanotify` claims that watches are
> not recursive, and should be set up for subdirectories separately, so such
> race condition is still there. Those could have been avoided if we could put
> a mark for the whole mount point / tree, but AFAIK that requires
> CAP_SYS_ADMIN.

I tested the program provided in the example and it works reporting any event whose type we ask, happening on a filesystem.
Here I am not sure what this recursive applies to.
Comment 31 Oded Arbel 2023-05-11 11:56:44 UTC
(In reply to Méven from comment #30)
> I tested the program provided in the example and it works reporting any
> event whose type we ask, happening on a filesystem.
> Here I am not sure what this recursive applies to.

Did you run the test with `CAP_SYS_ADMIN`? If so, did you test filesystem marks or directory marks?

The "recursive" issue is that fanotify only improves upon the issues with inotify if you can set a watch on a directory and receive all events on all of its subdirectories - without needing to register more watches on each sub directory. If this is not he case - as the man page definitely claim that it is not (unless you use mount or filesystem marks) - then we're still stuck with the race condition of a fast application (such as Ark) creating new directories and immediately new directories within them, and Baloo will not see files created in the sub directories.
Comment 32 Klaus 2023-09-12 14:11:54 UTC
I came to this bug report via this Reddit discussion:

https://www.reddit.com/r/kde/comments/nud5kj/outdated_file_results_in_application_launcher/

It describes a possible tightly related issue, where search results in KRunner and Application Launcher may lag behind the results in baloo itself, indicating that there may be some unnecessary intermediate caching. 

E.g. "baloosearch cardona" already gives the expected result, but KRunner/Application Launcher give no result, or an outdated location.
Comment 33 Jan Rathmann 2024-02-29 09:12:42 UTC
*** Bug 429006 has been marked as a duplicate of this bug. ***
Comment 34 Jan Rathmann 2024-02-29 09:21:07 UTC
*** Bug 457746 has been marked as a duplicate of this bug. ***