Bug 460509

Summary: Baloo indexes files temporarily mounted from other file systems
Product: [Frameworks and Libraries] frameworks-baloo Reporter: Adam Fontenot <adam.m.fontenot+kde>
Component: generalAssignee: baloo-bugs-null
Status: RESOLVED FIXED    
Severity: normal CC: nate, tagwerk19
Priority: NOR    
Version: 5.99.0   
Target Milestone: ---   
Platform: Arch Linux   
OS: Linux   
Latest Commit: Version Fixed In: Frameworks 6.1
Sentry Crash Report:

Description Adam Fontenot 2022-10-15 22:09:55 UTC
SUMMARY

I was previously under the impression that Baloo was intelligently skipping other file systems in its indexing, because even when it indexes them Dolphin (for whatever reason) doesn't seem to show them.

However, I have found using the `baloosearch` CLI that it is, in fact, indexing files on remote (network based) file systems.

STEPS TO REPRODUCE
1. Make sure file indexing is enabled and content indexing is disabled. (I'm not sure what would happen if content indexing were enabled and I'm afraid to find out.)
2. Mount a remote directory via sshfs to a part of your home directory that Baloo is set to index.
3. Activate Baloo and/or force a rescan.
4. Search for files known to be on the remote file system (e.g. via Dolphin and via baloosearch).

OBSERVED RESULT
The files appear in `baloosearch` and can therefore be assumed to be in the index. Actually, this is probably responsible for the unusually large index size I was experiencing (which tagwerk19 commented on here: https://bugs.kde.org/show_bug.cgi?id=460460#c1). 

The files do not appear in the Dolphin search tool - despite the fact that Dolphin *does* cross file system boundaries if Baloo is disabled. See here for that bug: https://bugs.kde.org/show_bug.cgi?id=460508

EXPECTED RESULT
Baloo does not index files on remote file systems. These are often temporarily mounted and can change location. E.g. the same files may be on ~/remote one day, ~/remote2 another day, and so on. I'm not sure if Baloo regards these files as deleted when the remote is unmounted, but if it does the result is probably disk thrashing as it has to update its index.

(I also expect Dolphin to show the same results as baloosearch, but that's a secondary issue here.)

SOFTWARE/OS VERSIONS
Operating System: Arch Linux
KDE Plasma Version: 5.26.0
KDE Frameworks Version: 5.99.0
Qt Version: 5.15.6
Kernel Version: 6.0.1-arch1-1 (64-bit)
Graphics Platform: X11
Comment 1 tagwerk19 2022-10-17 06:20:58 UTC
(In reply to Adam Fontenot from comment #0)
> ... I'm not sure if Baloo regards these files as deleted when the
> remote is unmounted, but if it does the result is probably disk thrashing
> as it has to update its index ...
It seems to do. And yes, large scale deletes seem to be hard work for baloo (as described in Bug 442453)

What might muddle things is dismounting such an sshfs mount from within your home directory might not generate an inotify that files have gone. Would need to check that.

I suspect if you do a "fusermount -u XXXX", baloo doesn't notice unless you do a "balooctl check" or until the next time you log in.
Comment 2 tagwerk19 2022-10-17 18:40:26 UTC
(In reply to Adam Fontenot from comment #0)
> ... Baloo was intelligently skipping other file systems in its indexing ...
That had apparently been implemented a while back - Bug 333433

But does seem to happen with sshfs
    https://bugs.kde.org/show_bug.cgi?id=460508#c3
Although don't have a feeling for what "people would expect (or want) to happen" 

Confirming...
Comment 3 Adam Fontenot 2022-10-18 22:35:23 UTC
(In reply to tagwerk19 from comment #2)
> (In reply to Adam Fontenot from comment #0)
> > ... Baloo was intelligently skipping other file systems in its indexing ...
> That had apparently been implemented a while back - Bug 333433
> 
> Although don't have a feeling for what "people would expect (or want) to
> happen" 
I'm thinking given Bug 333433 that Baloo should not index mounted file systems by default. Maybe an option could be provided, but the ability to manually "opt-in" specific directories by adding them to the indexing list is probably good enough.
Comment 4 Bug Janitor Service 2022-10-19 01:55:07 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/96
Comment 5 tagwerk19 2022-10-19 07:28:58 UTC
(In reply to Adam Fontenot from comment #3)
> ... Maybe an option could be provided, but the ability to
> manually "opt-in" specific directories by adding them to the indexing list
> is probably good enough ...
Wandering off into the territory of "personal preferences" here, but I trust the idea of fewest surprises...

    I think, if you plug in and mount a USB device (appearing under /run/media/<username>) it's not indexed. It's "removable".
    If you've taken the extra step and mounted something through /etc/fstab, it's not "removable", it can be indexed (if you wish).

Extending the model to sshfs...

    A command line sshfs mount, even if mounted within your home directory, should be considered "removable" and the 
    contents should not be indexed by default.
    Whereas you should be able to index an sshfs mount that was set up in /etc/fstab

Any folder include/excludes in the .config/baloofilerc should take precidence.

It's not so clear cut with Dolphin, I can imagine people expecting a search "From here" to search everything below "here", irrespective of whether it is local or remote. At least for filename searches.

You can see many bugs/issues/queries logged about Dolphin/Baloo's behaviour with symlinks, where the expectation is that "symlinks are followed". I think this is a reasonable indicator...
Comment 6 Adam Fontenot 2022-10-22 07:44:15 UTC
(In reply to tagwerk19 from comment #5)
> (In reply to Adam Fontenot from comment #3)
> > ... Maybe an option could be provided, but the ability to
> > manually "opt-in" specific directories by adding them to the indexing list
> > is probably good enough ...
> Wandering off into the territory of "personal preferences" here, but I trust
> the idea of fewest surprises...
Right - I think the one case where we can say indexing definitely *shouldn't* happen is when something is mounted "temporarily" - although maybe that isn't clearly defined yet. Basically, if there's any reason to think the path might be expected to change?

I agree with you that something's being in fstab is a good sign it's "permanent" and should be indexed. However, I think if Baloo is going to do that, several footguns need to be avoided:

 * The heuristics for determining which filesystems are permanent need to be pretty much flawless. You could have an fstab set up so that multiple USB drives are all mounted on demand to ~/usb. That's pretty much a worst case scenario. Files suddenly appear and disappear, Baloo trashes the database trying to delete everything, etc.
 * Some method for determining that a given file system is network-based is probably needed; I think content indexing should probably be turned off for these file systems by default. The user could always opt in for individual directories as needed.
 * Baloo needs to have a mechanism where downstream search tools don't see files on unmounted file systems in their searches.

In the mean time, I think the right move is to fix issues like this one (Baloo indexing huge file systems - multiple terabytes in my case - over the network) by changing the defaults so that Baloo never crosses file systems unless the user manually opts a folder in. We can make this work better when the issues above are solved.

Users with permanent file systems they want indexed are in a better position anyway. Because their paths are static, they can manually include things in the indexing list easily. I think that's one good reason to lean in the direction of not trying to do too much magic in Baloo by default. When Baloo is including stuff in directories that don't have static locations and you want it to stop, there's not much you can do about that.
Comment 7 Adam Fontenot 2024-02-18 17:51:42 UTC
My merge request from Oct 2022 never got any review. I rebased the changes and I'm leaving a note about it here in the hope that we can close this issue.
Comment 8 tagwerk19 2024-02-20 08:50:09 UTC
(In reply to Adam Fontenot from comment #7)
> My merge request from Oct 2022 never got any review. I rebased the changes
> and I'm leaving a note about it here in the hope that we can close this
> issue.
I remember stumbling upon this and then finding it near impossible to find it again. 
     https://invent.kde.org/plasma/plasma-desktop/-/issues/71
It might be of interest, it certainly runs against some of my earlier thoughts.
Comment 9 Adam Fontenot 2024-02-20 14:59:29 UTC
(In reply to tagwerk19 from comment #8)
> (In reply to Adam Fontenot from comment #7)
> > My merge request from Oct 2022 never got any review. I rebased the changes
> > and I'm leaving a note about it here in the hope that we can close this
> > issue.
> I remember stumbling upon this and then finding it near impossible to find
> it again. 
>      https://invent.kde.org/plasma/plasma-desktop/-/issues/71
> It might be of interest, it certainly runs against some of my earlier
> thoughts.

Encouraging better choices with network mounts is a good thing, but this is still needed. The issues I've seen with indexing are the result of FUSE mounts, not /etc/fstab or `mount -t nfs` type mounts that hang in the way this Plasma issue describes.
Comment 10 Felix Ernst 2024-03-29 10:32:16 UTC
Git commit 373cf1e567e2580145f137176d440da27c319f06 by Felix Ernst, on behalf of Adam Fontenot.
Committed on 29/03/2024 at 10:32.
Pushed by felixernst into branch 'master'.

Skip indexing KDE FS volumes unless user included

In 69411a, we changed the indexer behavior so that removable media
is not indexed by default. This commit tries to extend this
behavior to any temporarily mounted file system.

For instance, fuse.sshfs and overlay mounted file systems are
managed in Solid under the /org/kde/fstab parent. Most likely, users
will not want to index these file systems by default.

This commit also changes the initialization procedure for
StorageDevices. We now attempt to create a cached entry for *all*
Solid devices when initializing. It makes sense to do this because
`createCacheEntry` is already called whenever a device is added or
removed, without any further filtering. Trying to precisely specify
which devices to include at the initialization stage risks leaving
out devices like the /org/kde/fstab devices that are the subject
of this PR.
Related: bug 390830

M  +3    -3    src/file/storagedevices.cpp

https://invent.kde.org/frameworks/baloo/-/commit/373cf1e567e2580145f137176d440da27c319f06