Bug 402154 - Baloo reindexes everything after every reboot when using BTRFS filesystem
Summary: Baloo reindexes everything after every reboot when using BTRFS filesystem
Status: CONFIRMED
Alias: None
Product: frameworks-baloo
Classification: Unclassified
Component: Baloo File Daemon (show other bugs)
Version: 5.81.0
Platform: openSUSE RPMs Linux
: NOR normal
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords: efficiency
Depends on:
Blocks:
 
Reported: 2018-12-15 13:47 UTC by Thomas Pfeiffer
Modified: 2022-06-21 09:24 UTC (History)
22 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Pfeiffer 2018-12-15 13:47:08 UTC
SUMMARY

STEPS TO REPRODUCE
1. Reboot system

OBSERVED RESULT

baloo_file_extractor goes through all indexed folders and reindexes everything

EXPECTED RESULT

It should only index new/changed files

SOFTWARE/OS VERSIONS
Linux: 4.15.0
KDE Plasma Version:  5.14.4
KDE Frameworks Version: 5.53.0
Qt Version: 5.11.2
Comment 1 Pierre Baldensperger 2019-04-21 09:14:12 UTC
Yes, I have been observing very similar behaviour with many previous baloo versions for years, and up to the latest 5.57 (I use OpenSuSE Leap 15.0 with latest KDE packages from the "KDE Frameworks 5" repositories). To make it clear : it is not merely resetting the index, it is endlessly accumulating duplicates of it. Very symptomatic : upon _each_ reboot, the file counter (in "balooctl status") is _exactly_ _increased_ by the actual number of files, and indexing seems to painstakingly rebuild and append a new duplicate index each time, not realizing that these are indeed the same files it had already indexed before the reboot. I suspect it would behave the same after launching a "balooctl check" but I never tried this.

I should mention that I use a BTRFS file system so this might be part of the problem (and why so few people seem to experience the same issue). Maybe there is some misunderstanding between baloo and the way BTRFS reports or sets file attributes to identify them as "already indexed" ? I can't remember for sure, but I don't seem to remember seeing this problem before I switched to the BTRFS file system (I was using EXT4 previously, more than 2 years ago) : there were definitely problems with baloo at the time (crashes and such), but I don't remember seeing the same symptom of duplicate indexing. I have seen that there is a distinct bug report about suspiciously similar BTRFS problems :

https://bugs.kde.org/show_bug.cgi?id=401863

Of course, reboot after reboot, this behaviour triggers a never ending increase of resource bloat, not mentioning hours of slowdown after each reboot due to high CPU and memory load while the indexer browses again all the files to add an Nth duplicate to its index. At this stage, I would at least like to work around the problem by performing only a first-time indexing run and then stop the file content indexer while still being able to search the index (in krunner of dolphin). But unfortunately I never found any working way to reliably stop the file content indexer and still use the search engine (shouldn't one of "balooctl stop" or "balooctl suspend" allow this ?), so the only option seems to disable baloo completely and lose its search abilities. Moreover, in my experience, once the file content indexer has been started, the only way to really stop it is to kill the "baloo_file" process manually : otherwise it will survive all "balooctl" commands including "balooctl disable". Maybe this part belongs in a separate bug report : here also I found other similar bug reports complaining about "balooctl" not actually stopping or suspending "baloo_file" operation :

https://bugs.kde.org/show_bug.cgi?id=404121
https://bugs.kde.org/show_bug.cgi?id=353559

But none of these reports mentions that the search engine should remain usable even after the file content indexer has been stopped / killed. Also the file _content_ search should remain operational even after the file indexer has been instructed to stop indexing file content (in my experience, disabling the "file content indexing" option also immediately reduces the search scope to file names only, despite the existence of a recent content index). Or did I miss something about baloo search usage ?
Comment 2 Kieran Ramos 2019-06-05 20:19:51 UTC
I have also noticed similar behavior. I am hypothesizing that the problem is because I instructed baloo to also index a secondary drive which is using encrypted ZFS. Unfortunately on my last reboot my encrypted ZFS partition did not automatically mount, but after mounting it baloo started to reindex it. The reindexing takes hours and a lot of resources that slows down the system. This has been a recurring issue.
Comment 3 Stefan Brüns 2020-03-20 20:08:18 UTC
KF 5.53 and even 5.57 is way too old. Update to a current version.
Comment 4 Pierre Baldensperger 2020-03-28 09:15:43 UTC
(In reply to Stefan Brüns from comment #3)
> KF 5.53 and even 5.57 is way too old. Update to a current version.

Thanks a lot for following up on this.
I haven't been stuck on 5.57 : I regularly update my system, currently I am running KDE Frameworks 5.68.0 with Plasma 5.18.3 (under OpenSuSE Leap 15.1 if that matters).

The situation with baloo re-indexing is still *exactly* the same as last year, and it has been the same with (almost) every single KF release in the interval (not sure I tested with all of them but pretty close). No need to go into details again : they are still exactly the same as in my earlier post and the bug title summarizes it in a self-explanatory fashion. First indexing run seems to go all right, then _after_reboot_ it does apparently think that all files are new and re-indexes everything, not realizing they are exactly the same files. Reboot after reboot, this ends up with ever increasing index size and  multiplying the number of indexed files vs the number of actual files.

As I said in the previous post, I suspect that this may be due to some specificity of the BTRFS file system I use, but I have no way to test that hypothesis. Something with the BTRFS operation (system reboot related) may induce baloo into thinking that the files are new / not already indexed...? At least I don't see any other peculiarity of my system so, BTRFS aside, this problem would likely affect many users and wouldn't have gone almost unnoticed with so few reports : there would be tons of bug reports about the same thing.

HOWEVER, regarding the second part of my previous comment, I noticed that the command "balooctl suspend" now behaves as expected : it stops the frantic indexer, but I am still able to use the search function. So that's at least some substantial recovered functionality that makes baloo much much better than the dead weight it was before for me ! Thanks a lot to whoever improved this behavior !
Comment 5 Nate Graham 2020-10-26 16:02:10 UTC
You can thank Stefan for putting in tons and tons of work into Baloo. :)

As of 5.76 I no longer have any problems like this. It currently gets stuck on one of my files, but now notices this and skips that file, preventing this kind of endless re-indexing behavior. Are you still seeing it with Frameworks 5.75 or later?
Comment 6 Pierre Baldensperger 2020-10-27 06:36:25 UTC
(In reply to Nate Graham from comment #5)
> You can thank Stefan for putting in tons and tons of work into Baloo. :)
> 
> As of 5.76 I no longer have any problems like this. It currently gets stuck
> on one of my files, but now notices this and skips that file, preventing
> this kind of endless re-indexing behavior. Are you still seeing it with
> Frameworks 5.75 or later?

Of course, Stefan deserves zillions of thanks for working on this. Over the last few months and years, Baloo has definitely become much better. Even in my case it is now completely usable, provided I issue a "balooctl suspend" each time I open a new session to prevent this strange "re-indexing" behaviour.

Currently I am still on KF 5.75 (OpenSuSE Leap 15.2). I have just run a test overnight after reading your message, and unfortunately I must confirm that the behaviour is still present and Baloo is still re-indexing everything after each reboot. I am looking forward to 5.76 to see if this is actually fixed.
Comment 7 Nate Graham 2020-10-27 10:45:47 UTC
Thanks for the info!
Comment 8 Eridani Rodríguez 2020-10-29 18:20:53 UTC
My system seems to be affected too, in my case / is BTRFS while /home/MyUser/Desktop or /home/MyUser/Music (and all other user folders) are on separated ZFS datasets, none of them are encrypted, but both use compression among other features. Baloo will reindex all /home/MyUser/ZFSMonutedDirectories on reboot.

I'll leave some additional specs of another system (mine) in hopes that it could be helpful:

Operating System: KDE neon 5.20
KDE Plasma Version: 5.20.2
KDE Frameworks Version: 5.75.0
Qt Version: 5.15.0
Kernel Version: 5.4.0-52-generic
OS Type: 64-bit
Processors: 4 × Intel® Core™ i5-4670 CPU @ 3.40GHz
Memory: 15.6 GiB of RAM
Graphics Processor: GeForce GTX 1060 3GB/PCIe/SSE2

btrfs-progs v5.4.1 
zfs-0.8.3-1ubuntu12.4
zfs-kmod-0.8.3-1ubuntu12.4

ZFS INFO for an affected dataset
--------------------
zfs get all Link/Home/Music
NAME             PROPERTY               VALUE                   SOURCE
Link/Home/Music  type                   filesystem              -
Link/Home/Music  creation               sáb jun  6 16:13 2020  -
Link/Home/Music  used                   20.8G                   -
Link/Home/Music  available              655G                    -
Link/Home/Music  referenced             20.8G                   -
Link/Home/Music  compressratio          1.01x                   -
Link/Home/Music  mounted                yes                     -
Link/Home/Music  quota                  none                    default
Link/Home/Music  reservation            none                    default
Link/Home/Music  recordsize             128K                    default
Link/Home/Music  mountpoint             /home/MyUser/Music     local
Link/Home/Music  sharenfs               off                     default
Link/Home/Music  checksum               on                      default
Link/Home/Music  compression            on                      inherited
Link/Home/Music  atime                  on                      default
Link/Home/Music  devices                on                      default
Link/Home/Music  exec                   on                      default
Link/Home/Music  setuid                 on                      default
Link/Home/Music  readonly               off                     default
Link/Home/Music  zoned                  off                     default
Link/Home/Music  snapdir                hidden                  default
Link/Home/Music  aclinherit             restricted              default
Link/Home/Music  createtxg              261                     -
Link/Home/Music  canmount               on                      default
Link/Home/Music  xattr                  on                      default
Link/Home/Music  copies                 1                       default
Link/Home/Music  version                5                       -
Link/Home/Music  utf8only               off                     -
Link/Home/Music  normalization          none                    -
Link/Home/Music  casesensitivity        sensitive               -
Link/Home/Music  vscan                  off                     default
Link/Home/Music  nbmand                 off                     default
Link/Home/Music  sharesmb               off                     default
Link/Home/Music  refquota               none                    default
Link/Home/Music  refreservation         none                    default
Link/Home/Music  guid                   14467266995499749484    -
Link/Home/Music  primarycache           all                     default
Link/Home/Music  secondarycache         all                     default
Link/Home/Music  usedbysnapshots        51.8M                   -
Link/Home/Music  usedbydataset          20.8G                   -
Link/Home/Music  usedbychildren         0B                      -
Link/Home/Music  usedbyrefreservation   0B                      -
Link/Home/Music  logbias                latency                 default
Link/Home/Music  objsetid               167                     -
Link/Home/Music  dedup                  off                     default
Link/Home/Music  mlslabel               none                    default
Link/Home/Music  sync                   standard                default
Link/Home/Music  dnodesize              legacy                  default
Link/Home/Music  refcompressratio       1.01x                   -
Link/Home/Music  written                0                       -
Link/Home/Music  logicalused            21.1G                   -
Link/Home/Music  logicalreferenced      21.0G                   -
Link/Home/Music  volmode                default                 default
Link/Home/Music  filesystem_limit       none                    default
Link/Home/Music  snapshot_limit         none                    default
Link/Home/Music  filesystem_count       none                    default
Link/Home/Music  snapshot_count         none                    default
Link/Home/Music  snapdev                hidden                  default
Link/Home/Music  acltype                off                     default
Link/Home/Music  context                none                    default
Link/Home/Music  fscontext              none                    default
Link/Home/Music  defcontext             none                    default
Link/Home/Music  rootcontext            none                    default
Link/Home/Music  relatime               on                      inherited
Link/Home/Music  redundant_metadata     all                     default
Link/Home/Music  overlay                off                     default
Link/Home/Music  encryption             off                     default
Link/Home/Music  keylocation            none                    default
Link/Home/Music  keyformat              none                    default
Link/Home/Music  pbkdf2iters            0                       default
Link/Home/Music  special_small_blocks   0                       default
Link/Home/Music  com.sun:auto-snapshot  on                      inherited
Comment 9 Stefan Brüns 2020-10-29 18:22:40 UTC
ZFS is not supported.
Comment 10 Pierre Baldensperger 2020-11-15 21:08:58 UTC
(In reply to Nate Graham from comment #5)
> You can thank Stefan for putting in tons and tons of work into Baloo. :)
> 
> As of 5.76 I no longer have any problems like this. It currently gets stuck
> on one of my files, but now notices this and skips that file, preventing
> this kind of endless re-indexing behavior. Are you still seeing it with
> Frameworks 5.75 or later?

Just had an update to 5.76 today. Unfortunately the problem doesn't seem to be solved in my case (BTRFS). I just did a test run (balooctl disable; balooctl purge; balooctl enable), waited for the indexing to finish (ca. 500k files), rebooted... and unfortunately, after scanning for new files to index it starts re-indexing everything just like before (pushing the "total" number of indexed files to 1 million). So again I suspended the indexer. If there is anything I can test or submit to help diagnose the root cause, just let me know.
Comment 11 Pierre Baldensperger 2021-04-26 16:47:01 UTC
Still same behaviour with frameworks 5.81.0 (OpenSuSE Leap 15.2, BTRFS file system).
Comment 12 tagwerk19 2021-04-26 22:22:08 UTC
(In reply to Pierre Baldensperger from comment #11)
> Still same behaviour with frameworks 5.81.0 (OpenSuSE Leap 15.2, BTRFS file
> system).
It's on a reboot and not when you logout and back in again?

Try a simple test...

Maybe set up a test user so you don't have to reindex everything. Create a test file and check its details...

    echo "Hello Penguin" > testfile.txt
    stat testfile.txt
    balooshow -x testfile.txt

I get:

    $ stat testfile.txt
      File: testfile.txt
      Size: 14              Blocks: 8          IO Block: 4096   regular file
    Device: 38h/56d Inode: 5089        Links: 1
    Access: (0644/-rw-r--r--)  Uid: ( 1001/    test)   Gid: (  100/   users)
    Access: 2021-04-26 23:38:06.214398262 +0200
    Modify: 2021-04-26 23:38:06.214398262 +0200
    Change: 2021-04-26 23:38:06.214398262 +0200
     Birth: 2021-04-26 23:38:06.214398262 +0200

    $ balooshow -x testfile.txt
    13e100000038 56 5089 testfile.txt [/home/test/testfile.txt]
            Mtime: 1619473086 2021-04-26T23:38:06
            Ctime: 1619473086 2021-04-26T23:38:06

    Internal Info
    Terms: Mplain Mtext T5 T8
    File Name Terms: Ftestfile Ftxt
    XAttr Terms:

Keep an eye on the "Device:" number (the 38 Hex, 56 decimal above)

Reboot and run the stat and balooshow again.

Interesting to know if the device number has changed, and whether the balooshow details have also changed...
Comment 13 tagwerk19 2021-04-27 06:26:09 UTC
This might also explain the instances of a search finding many copies or the same file. Looking at the filesystem with many subvols:

    $ df
    Filesystem     1K-blocks    Used Available Use% Mounted on
    devtmpfs         1994516       0   1994516   0% /dev
    tmpfs            2006844       0   2006844   0% /dev/shm
    tmpfs             802740    1368    801372   1% /run
    tmpfs               4096       0      4096   0% /sys/fs/cgroup
    /dev/vda2       31447040 8928724  22190748  29% /
    /dev/vda2       31447040 8928724  22190748  29% /.snapshots
    /dev/vda2       31447040 8928724  22190748  29% /root
    /dev/vda2       31447040 8928724  22190748  29% /var
    /dev/vda2       31447040 8928724  22190748  29% /srv
    /dev/vda2       31447040 8928724  22190748  29% /home
    /dev/vda2       31447040 8928724  22190748  29% /opt
    /dev/vda2       31447040 8928724  22190748  29% /usr/local
    /dev/vda2       31447040 8928724  22190748  29% /boot/grub2/x86_64-efi
    /dev/vda2       31447040 8928724  22190748  29% /boot/grub2/i386-pc
    tmpfs            2006848       0   2006848   0% /tmp
    tmpfs             401368      36    401332   1% /run/user/1001

and rebooting half a dozen times, I get:

    $ baloosearch "Hello Penguin"
    /home/test/testfile.txt
    /home/test/testfile.txt
    /home/test/testfile.txt
    /home/test/testfile.txt
    /home/test/testfile.txt
    Elapsed: 1.27381 msecs

It seems clear that these files are reindexed after the system has been rebooted.

Seems also to be the case that files in the index whose internal Id's do not match up to anything existant on the filesystem are not cleaned up.

SOFTWARE/OS VERSIONS

    openSUSE Tumbleweed 20210325
    Plasma: 5.21.3
    Frameworks: 5.80.0
    Qt: 5.15.2
Comment 14 Pierre Baldensperger 2021-04-27 21:12:43 UTC
(In reply to tagwerk19 from comment #12)
> 
> Try a simple test...
> 
> (...)
> 
> Interesting to know if the device number has changed, and whether the
> balooshow details have also changed...

Thank you very much for the helpful hints in diagnosing this.
You are spot on !!
Indeed the device number changes after every reboot.

$ diff stat1.log stat2.log 
< Périphérique : 35h/53d  Inœud : 24588954    Liens : 1
---
> Périphérique : 37h/55d  Inœud : 24588954    Liens : 1

And there is also a corresponding change in balooshow.

$ diff baloo1.log baloo2.log
< 177329a00000035 53 24588954 testfile.txt [/home/test/testfile.txt]
---
> 177329a00000037 55 24588954 testfile.txt [/home/test/testfile.txt]

A baloosearch returns the same file twice.
And I do indeed have a bunch of subvols.

Now hopefully somebody who knows the internals of baloo deduplication criteria might be able to understand where this behaviour is coming from, and confirm that this is likely a BTRFS-specific problem.
Comment 15 tagwerk19 2021-04-27 21:31:11 UTC
Can probably flag this as CONFIRMED then...
Comment 16 tagwerk19 2021-04-28 09:19:23 UTC
(In reply to Pierre Baldensperger from comment #14)
> Now hopefully somebody who knows the internals of baloo deduplication
> criteria might be able to understand where this behaviour is coming from,
> and confirm that this is likely a BTRFS-specific problem.
I think it's a question of "levels of indirection", BTRFS adds an extra level, could be that other filesystems do so as well.

I have a feeling this is going to be awkward. Looking at a system with two BTRFS discs, 'vda1' and 'vdb1', they also appear with different minor device numbers - the same as subvols on a single disc. Hmmm....
Comment 17 tagwerk19 2021-04-28 09:36:20 UTC
Looks as if this is a long term issue...
Scroll down to the last posts in Bug 404057
Comment 18 Kai Krakow 2021-04-28 16:44:58 UTC
(In reply to tagwerk19 from comment #17)
> Looks as if this is a long term issue...
> Scroll down to the last posts in Bug 404057

Yep, this problem has already been deeply analyzed and is well understood. The referenced bug report includes a lot of thoughts, possible solutions, and also a few real improvements as patches - some of those were merged. There are also links to phabricator with extended discussion. I suggest to read that entirely to understand the problem (some later comments re-decide on previous thoughts).

Sadly, I mostly lost interest in this issue in favor of other more important or personal stuff. I simply ditched baloo since then as I wasn't really using it anyway that much.

But if anyone wants to take the effort in crafting any patches, they might want to start with implementing the mapping table from volume/subvolume UUID to a virtual device number - and that virtual device number would than be used instead of the real one. This way, a distinct file system would always show up as the same device number in baloo - no matter on which device node it appeared. It solves almost all of the problems mentioned here. I volunteer to mentor/help with such an implementation, I'm just too bad with Qt/KDE APIs to kickstart that myself.

Later improvements should look at access patterns and how to optimize that, maybe LMDB can be used in a better way to optimize it for background desktop access patterns, otherwise it may need to be replaced with some other backend that's better at writing data to the database (aka, less scattering of data over time): LMDB is optimized for reads and appends, much less for random writes (but the latter is the most prominent access pattern for a desktop search index). So if we stay with LMDB, baloo needs to be optimized to prevent rewrites in favor of appends - without blowing up the DB size too much. It may mean to purge still existing data from the LMDB mmap in favor of a bigger continuous block of free DB memory. Also, aggressive write coalescing is needed to avoid fragmentation access patterns in filesystems.
Comment 19 Kai Krakow 2021-04-28 17:07:55 UTC
BTW: Such a UUID-to-deviceId mapping table would allow baloo to properly support most yet unsupported filesystems, probably also zfs. With such an idea implemented, the only requirement left to a supported filesystem would be that it has stable inodes across re-mounts/re-boots (most have, some don't) and supports reverse lookups (inode to path).

The problematic design decision is how baloo identifies files: each file is assigned a devId/inodeId number (each lower 32-bit only, combined into a 64-bit fileId). If this magic number changes (that happens in zfs, btrfs, nfs...), the file appears as new. But neither Linux nor POSIX state anywhere that this can be used as an id to uniquely identify files - unless you never remount or reboot. Also, re-used inode numbers (especially after clipping at 32 bit) will completely mess up and confuse baloo.

So this needs are multi-step fix: First (and most importantly) introduce virtual deviceIds by implementing a mapping table "volume/subvolId <-> virtualDeviceId" where virtualDeviceId would be a monotonically increasing number used uniquely throughout the index as a device id. Next step: Enlarge fileIds from 64 to 128 bit, so it can be crafted from 64-bit devid/inode without clipping/wraparound.

On the pro side, such a mapping table would also allow to properly clean up index data from the DB for file systems no longer needed. Currently, baloo never knows if a file system would appear or doesn't. This could be implemented in one of the later steps as some sort of housekeeping optimizations.
Comment 20 tagwerk19 2021-05-01 18:44:16 UTC
(In reply to Kai Krakow from comment #18)
> ... I suggest to
> read that entirely to understand the problem ...
I've done my best :-) Thank you for the info!

In:

    https://bugs.kde.org/show_bug.cgi?id=404057#c35

You have the the idea of an "Index per Filesystem" but then the idea seems to have been put to the side. You mention "storage path" as a problem? Would the way "local wastebaskets" are managed on mounted filesystems be a model? They have to deal with the same issues as you've listed.

    https://phabricator.kde.org/T9805 

Has a mention of "... inside encrypted containers", see this also in Bug 390830.

As background thoughts...

    Things like "Tags:" folders in Dolphin and incremental searches
    when you type into Krunner depend on baloosearch being lightning fast.

    It would be a shame to lose the ability to search for phrases as in
        baloosearch Hello_Penguin
    as opposed to
        baloosearch "Hello Penguin"

    I'm guessing BTRFS usage is going to grow.
Comment 21 tagwerk19 2021-05-01 19:22:31 UTC
As a workround in openSuse, my test install had a /etc/fstab:

    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /                       btrfs  defaults                      0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /var                    btrfs  subvol=/@/var                 0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /usr/local              btrfs  subvol=/@/usr/local           0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /srv                    btrfs  subvol=/@/srv                 0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /root                   btrfs  subvol=/@/root                0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /opt                    btrfs  subvol=/@/opt                 0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /home                   btrfs  subvol=/@/home                0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /boot/grub2/x86_64-efi  btrfs  subvol=/@/boot/grub2/x86_64-efi  0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /boot/grub2/i386-pc     btrfs  subvol=/@/boot/grub2/i386-pc  0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /.snapshots             btrfs  subvol=/@/.snapshots          0  0
    UUID=a500b70e-4811-4db2-ab79-34e99b142b57  swap                    swap   

It seems that the BTRFS mounts are performed in parallel and there seems to be no option to specify that specific mounts appear with fixed device numbers.

It is however possible to add "x-systemd.requires" options in the /etc/fstab that suggest "an order" that mounts are done in - and the device numbers seem to be allocated in the order of the mounts.

This can only be described as a hack and quite likely fragile.

If "/home" is set to depend on "/" and /.snapshots" and the other BTRFS subvols set to depend on "/home" then mount order is better defined and the device number allocated for /home *seems* stable.

With the "x-systemd.required"s added, my /etc/fstab looks like:

    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /                       btrfs  defaults                      0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /var                    btrfs  subvol=/@/var,x-systemd.requires=/home    0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /usr/local              btrfs  subvol=/@/usr/local,x-systemd.requires=/home  0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /srv                    btrfs  subvol=/@/srv,x-systemd.requires=/home    0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /root                   btrfs  subvol=/@/root,x-systemd.requires=/home   0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /opt                    btrfs  subvol=/@/opt,x-systemd.requires=/home    0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /home                   btrfs  subvol=/@/home,x-systemd.requires=/,x-systemd.requires=/.snapshots       0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /boot/grub2/x86_64-efi  btrfs  subvol=/@/boot/grub2/x86_64-efi,x-systemd.requires=/home  0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /boot/grub2/i386-pc     btrfs  subvol=/@/boot/grub2/i386-pc,x-systemd.requires=/home  0  0
    UUID=19af1d21-f5c2-4518-a839-a3f2afdb199c  /.snapshots             btrfs  subvol=/@/.snapshots          0  0
    UUID=a500b70e-4811-4db2-ab79-34e99b142b57  swap                    swap   defaults                      0  0
Comment 22 Kai Krakow 2021-05-02 11:22:49 UTC
(In reply to tagwerk19 from comment #20)
> (In reply to Kai Krakow from comment #18)
> > ... I suggest to
> > read that entirely to understand the problem ...
> I've done my best :-) Thank you for the info!
> 
> In:
> 
>     https://bugs.kde.org/show_bug.cgi?id=404057#c35
> 
> You have the the idea of an "Index per Filesystem" but then the idea seems

I didn't... I explained why that would not work.

> to have been put to the side. You mention "storage path" as a problem? Would
> the way "local wastebaskets" are managed on mounted filesystems be a model?
> They have to deal with the same issues as you've listed.

The problem is that you would have do deal with proper synchronization when multiple databases are used. That is not just "find a writeable storage location and register this location somewhere". Also, you would need to have all these different DBs opened at the same time, and LMDB is a memory mapped database with random access patterns. So you'd multiply the memory pressure with each location, and that will dominate the filesystem cache.

>     https://phabricator.kde.org/T9805 

This mentions "store an identifier per tracked device, e.g the filesystem UUID" which is probably my idea. Instead of using dev_id directly, the database should have a lookup table where filesystem UUIDs are stored as a simple list. The index of this list can be used as the new dev_id for the other tables.

> Has a mention of "... inside encrypted containers", see this also in Bug
> 390830.

Encrypted containers should never be indexed in a global database as that would leak information from the encrypted container. The easiest solution would be to just not index encrypted containers unless the database itself is stored in an encrypted container - but that's also just an bandaid. Maybe encrypted containers should not be stored at all. Putting LMDB on an encrypted containers may have very bad side-effects on the performance side.

> As background thoughts...
> 
>     Things like "Tags:" folders in Dolphin and incremental searches
>     when you type into Krunner depend on baloosearch being lightning fast.

Having multiple databases per filesystem can only make this slower by definition because you'd need to query multiple databases. From my personal experience with fulltext search engines (ElasticSearch) I can only tell you that querying indexes and recombining results properly is a huge pita, and it's going to slow things way down. So the multiple database idea is probably a dead end.

>     It would be a shame to lose the ability to search for phrases as in
>         baloosearch Hello_Penguin
>     as opposed to
>         baloosearch "Hello Penguin"
> 
>     I'm guessing BTRFS usage is going to grow.

The point is: Neither Linux nor POSIX state anywhere that a dev_id from stat() is unique across reboots or remounts. This is even less true for inode numbers with some remote filesystems or non-inode filesystems (where inode numbers are virtual and may be allocated from some runtime state). Those are not stable ids. At least for native Linux-filesystems we can expect inode numbers to be stable as those are stored inside the FS itself (the dev_id isn't but UUID is).

On a side-note: In this context it would make sense to provide baloo as a system-wide storage and query service shared by multiple users, with an indexer running per user (to index encrypted containers). It's the only way to support these ideas:

- safe access to encrypted containers
- the database can be isolated from being readable by users (prevents
  information leakage)
- solves the problem of multiple users indexing the same data multiple times
- has capabilities to properly read UUIDs from filesystems/subvolumes (some
  FS only allow this for root)
- can guard/filter which results are returned to users (by respecting FS ACLs
  and permission bits)
- shared index location (e.g. /usr/share/docs) would be indexed just once

On the contra side:

- needs some sort of synchronization between multiple indexers (should work
  around race conditions that multiple indexers do not read and index the same
  files twice), could be solved by running the indexer within the system-wide
  service, too, but access to encrypted containers needs to be evaluated
Comment 23 Stefan Brüns 2021-05-02 12:55:45 UTC
There is nothing new to add here. Please refrain from any further comments, if it does not any new information.

1. Baloo uses inodes and device ids as document IDs. inodes can be considered stable on all supported file systems, while device ids will (may) change when adding additional drives, plugging external storage in varying order, and apparently also for BTRFS subvolumes. This is a current design limitation. All this has been known for years, and also mentioned in some Phabricator tasks. Nothing new here.

2. Work has been under way to fix this for quite some time, and several places where db storage layer and filesystem layer were tightly coupled have already been cleaned up, though this is work in progress.

3. This restructuring takes time, and I only do this in my spare time. Tons of rude and abusive comments on e.g. reddit/kde and Phoronix have taken its toll, and I no longer spent the amount of time on Baloo (and KFileMetadata for the extractors) I once did. Lack of review(ers) also does not help.

If you really want to support development and show some appreciation, I have a Liberapay account: https://liberapay.com/StefanB/donate
Comment 24 tagwerk19 2021-05-08 06:46:12 UTC
(In reply to tagwerk19 from comment #21)
> It is however possible to add "x-systemd.requires" options in the /etc/fstab
> that suggest "an order" that mounts are done in - and the device numbers
> seem to be allocated in the order of the mounts.
> 
> ... quite likely fragile ...
With further tests, too fragile :-(

Need either a way of specifying that a mount uses a given device number or baloo can adapt to running on such "shifting sands"
Comment 25 tagwerk19 2021-05-13 20:42:45 UTC
(In reply to tagwerk19 from comment #24)
> With further tests, too fragile :-(
Maybe the "subvolid" that findmnt gives you is better:

    findmnt -T testfile.txt

reports the mount point, the filesystem type and a "subvolid" (in the case of BTRFS).

It seems that "subvolid" is stable. It is possible to change it but it doesn't change on it's own (as far as I can tell...)
Comment 26 Stefan Brüns 2021-05-13 21:24:19 UTC
Getting a stable ID is not the hard part, but changing everything in the internals to use an indirection layer is.
Comment 27 t.schmittlauch 2021-06-01 16:37:00 UTC
While the issue seems to be clear now, I'd like to add a baloo log message supporting that. This should also make this bug more discoverable when people are searching for it.

In my system journal, I get the following message for each indexed file:

kf.baloo: "/home/some/file" id seems to have changed. Perhaps baloo was not running, and this file was deleted + re-created

This is in line with the brought up internal ID changes.
Comment 28 Joachim Wagner 2022-01-13 12:32:13 UTC
@Kai: Is there a bug / feature request for the system-wide indexing you mention? I'd like to add more to the contra side.

@Stefan: Rather than changing code in baloo to implement the mapping from UUID+subvolumeID to internal-fs-ID, how about executing baloo in a wrapper that redefines `stat()` to modify st_dev? Yes, this is a hack but it may be enough while waiting for https://github.com/util-linux/util-linux/issues/1562 .
Comment 29 Joachim Wagner 2022-01-13 15:45:14 UTC
An alternative to relying on UUIDs and sub-volume IDs is to assume mount points of filesystems do not change and to proceed as follows:

* Have a persistent table `I` mapping mount points to internal filesystem ID (currently the device number stat.st_dev).
* In each run, start with an empty table `M` mapping device numbers to mount points.
* During indexing, query stat.st_dev as usual. If stat.st_dev is not yet in `M` find out what the mount point is and add it. Otherwise, obtain the mount point from `M`. (We could do without table `M` but that would be slow and table `M` is expected to stay tiny. If in doubt, use proper cache logic to limit the size of `M`.)
* Look up our internal filesystem ID in `I` with the mount point. If not in `I` yet allocate a new ID for it.
Comment 30 tagwerk19 2022-01-14 08:16:11 UTC
(In reply to Joachim Wagner from comment #28)
> https://github.com/util-linux/util-linux/issues/1562 .
Like :-)

Seems that there are two options here: a fix to filesystem/mount to permit a "specified" device number (minor device number...) or a reengineering of baloo to use a longer-and-unique Disc/Partition ID.

I'll pick out Stefan's Comment #26:
> Getting a stable ID is not the hard part, but changing everything in the internals to use an indirection layer is.

Stefan has stepped down as maintainer.

Baloo improved *massively* on his watch and thanks are due. A new enthusiast would however be welcome.
Comment 31 tagwerk19 2022-01-14 08:32:36 UTC
(In reply to Joachim Wagner from comment #29)
> An alternative to relying on UUIDs and sub-volume IDs is to assume mount
> points of filesystems do not change and to proceed as follows...
Apologies, I fear you'll have to step through your process for me. I'm somehow missing something...

If I look on Tumbleweed I can see results from:

    stat testfile
    stat -f testfile
    findmnt -nT testfile

These give me the major/minor device numbers + inode of the testfile, the "filesystem ID" and mount point (with BTRFS subvol/subvolid).

The minor device number jumps around with reboots. The filesystem ID, subvol and subvolid seem solid. Snippets of my last reboots and updates:

    Device: 0,40    Inode: 2506
    ID: cef844b93a5a00ff
    BTRFS: subvolid=263,subvol=/@/home

    Device: 0,39    Inode: 2506
    ID: cef844b93a5a00ff
    BTRFS: subvolid=263,subvol=/@/home

    Device: 0,46    Inode: 2506
    ID: cef844b93a5a00ff
    BTRFS: subvolid=263,subvol=/@/home

    Device: 0,40    Inode: 2506
    ID: cef844b93a5a00ff
    BTRFS: subvolid=263,subvol=/@/home

    Device: 0,41    Inode: 2506
    ID: cef844b93a5a00ff
    BTRFS: subvolid=263,subvol=/@/home

    Device: 0,43    Inode: 2506
    ID: cef844b93a5a00ff
    BTRFS: subvolid=263,subvol=/@/home 

At the moment, with every reboot, baloo indexes or reindexes the testfile "under" it's new docID (device number/inode) and over time, it gathers quite a collection of entries:

    baloosearch -i testfile

    9ca00000027 /home/test/testfile
    9ca0000002f /home/test/testfile
    9ca0000002e /home/test/testfile
    9ca0000002d /home/test/testfile
    9ca0000002c /home/test/testfile
    9ca0000002b /home/test/testfile
    9ca0000002a /home/test/testfile
    9ca00000029 /home/test/testfile
    9ca00000028 /home/test/testfile

The mapping would have to work when indexing (going from full filename to an invariant, unique, internal docID) and when searching (going from the docID to the canonical filename).
Comment 32 Joachim Wagner 2022-01-14 12:20:33 UTC
(In reply to tagwerk19 from comment #31)
> Apologies, I fear you'll have to step through your process for me. I'm
> somehow missing something...
> [...]
> The mapping would have to work when indexing (going from full filename to an
> invariant, unique, internal docID)

I only described the indexing part. The docID is the pair (filesystemID, inode_number) where filesystemID := I(mount_point(filepath)). M is only introduced to make determining mount_point(filepath) more efficient by using cached values M(stat.st_dev(filepath)). The number of cache entries never exceeds the number of mounted filesystems.

> and when searching (going from the docID
> to the canonical filename).

To get from docID to the filepath, without storing the filepath, one can maintain a reverse map of I to get the mount point for a given internal filesystem ID. Once one has the mount point, one can get the current stat.st_dev for the filesystem which is currently used to get the filepath for a given inode_number. 

I am suggesting this alternative as the current proposal requires filesystem-specific code such as looking for the special string "subvolid" in findmnt output. Another filesystem may call it something else. One doesn't want to write code for each possible filesystem and update it each time somebody publishes a new filesystem.
Comment 33 tagwerk19 2022-01-15 09:07:12 UTC
(In reply to Joachim Wagner from comment #32)
> ... a given internal filesystem ID ...
Maybe that's where I'm getting muddled...

statvfs and "stat -f" give a 64 bit "Filesystem ID" and I was imagining you were talking about that. If I've followed the breadcrumbs right this comes from the UUID (for BTRFS). Ref:
    http://lkml.iu.edu/hypermail/linux/kernel/0809.0/0593.html

It looks straightforward to get the filesystem ID for a file. However, it needs more space than a device number and thus a lookup table.

> ... One doesn't want
> to write code for each possible filesystem and update it each time somebody
> publishes a new filesystem ...
Perhaps the f_fsid field is sufficient
Comment 34 Joachim Wagner 2022-01-15 17:50:28 UTC
(In reply to tagwerk19 from comment #33)
> statvfs and "stat -f" give a 64 bit "Filesystem ID" and I was imagining you
> were talking about that.

No, I meant "baloo-internal filesystem ID", a sequentially allocated number as in the proposal discussed before. Difference in my proposal is that a new mount point triggers the allocation, rather than a new UUID+subvolid pair that may be difficult to obtain.

>     http://lkml.iu.edu/hypermail/linux/kernel/0809.0/0593.html

It says "For bfs and xfs it's the block device". This means the ID from stat- f it is NOT suitable as a filesystem ID as the block device major:minor can change. Examples:
(1) 2 or more NVMe SSDs: While the first SSD is always /dev/nvme0n1 and the 2nd /dev/nvme1n1, it is random which one gets 259:0.
(2) 2 ore more dm-crypt devices with same iter-time: It is random which one becomes /dev/dm-0, which always is 254:0.

> It looks straightforward to get the filesystem ID for a file.

I haven't seen yet anywhere here a filesystem ID that is stable across restarts and accessible in a standardised way for any filesystem type. Hence my proposal to move away from system-provided IDs and to use the mount point as an identifier instead.
Comment 35 tagwerk19 2022-01-16 09:03:38 UTC
(In reply to Joachim Wagner from comment #34)
> ... Hence my proposal
> to move away from system-provided IDs and to use the mount point
> as an identifier instead ...
Accepted.

Although I think we need to look at "what we can trust most".

If ext2/3/4, BTRFS, NTFS give a stable filesystem ID, we should make the most of it to help when mounting storage on a different mount point (saying, yes, we know this disc) or when mounting different storage on a fixed mount point (this isn't the disc it used to be). If the mount point and Filesystem ID disagree, provided it's a reliable Filesystem ID, we should go with that Filesystem ID

This would mean including the filesystem ID in your "I" table and careful making judgements when a disc is seen to move, vanish or reappear.

Having kept tabs on baloo issues for a couple of years, the majority of the "reindexing" or "duplicated results" issues have been from OpenSUSE and thus BTRFS with multiple subvols. I don't remember seeing any reports mentioning XFS but then you are not prompted for filesystem type when submitting a bug report. Maybe there were some that mentioned Mandriva but I never got to the bottom of those. I don't know the status with ZFS.

If we wanted an intellectual challenge to shake out the edge cases, we can think how to deal with symbolic links 8-]
Comment 36 tagwerk19 2022-01-16 09:20:32 UTC
(In reply to tagwerk19 from comment #35)
> ... mentioned Mandriva ...
Maybe Manjaro ...
Comment 37 Joachim Wagner 2022-01-16 15:08:50 UTC
(In reply to tagwerk19 from comment #35)
> Accepted.
> Although I think we need to look at "what we can trust most".
> [...]
> This would mean including the filesystem ID in your "I" table and careful
> making judgements when a disc is seen to move, vanish or reappear.

Yes, a hybrid approach would be a good default as long as the filesystem ID does not change with the major:minor of the block device. For filesystems for which baloo does not know how to get a filesystem ID the ID could be "N/A" and any transition between N/A and a proper ID would also mean that the filesystem is new.

> Having kept tabs on baloo issues for a couple of years, the majority of the
> "reindexing" or "duplicated results" issues have been from OpenSUSE [...]

The openSUSE installer uses btrfs by default.

> [...] I don't remember seeing any reports mentioning
> XFS [...]  I don't know the status with ZFS.

I'd think XFS users typically either have a simple setup or use LVM on top of a complex storage setup and LVM seems to allocate the /dev/dm-* devices in a predictable order; at least I was using this setup for many years without baloo reindexing repeatedly.

> think how to deal with symbolic links 8-]

This should go into a separate feature or documentation request. I see 4 decision to make, either hard-coded or configurable:
(1) Symlinks to other folders: If the target folder is indexed anyway the link can be ignored. If not, the default probably should be not to follow the link as the target folder is under a folder that the user specifically excluded from indexing. (follow yes/no)
(2) Indexing of the path of the target: One could index symlinks treating them like text files that contain just the target path as plain text. (index path yes/no)
(3) Content indexing for symlinks to files: If the target is indexed anyway question is whether to enter the symlink as a duplicate result under a different name. If not, like for folders, the default probably should be not to index the file but this probably should be configurable as users may want to use symlinks to bring otherwise excluded files into the index.
Comment 38 tagwerk19 2022-01-18 08:17:11 UTC
(In reply to Joachim Wagner from comment #37)
> (1) Symlinks to other folders: If the target folder is indexed anyway the
> link can be ignored. If not, the default probably should be not to follow
> the link as the target folder is under a folder that the user specifically
> excluded from indexing.
Symlinks provide a bit of an edge case :-)

There are a stream of issues reported. At the moment baloo deliberately avoids following symlinks when indexing whereas dolphin searches do follow them. There's a summary under Bug 447119

A commonly reported scenario is that people have a separately mounted disc with a symlink to it (as a way to give extra space for ~/Pictures, ~/Videos or whatever).

What might (should?) happen here if we look at mount points?

I can see:
    stat -f ~/symlinkto/myfile
or:
    findmnt -nT ~/symlinkto/myfile
give the Filesystem ID and mount point for the destination disc and ignore the fact that you have followed a symlink to get to it. I'd say it makes sense to deal with the canonical names (on the destination device) while indexing and do any adjustments to search results wrt symlinks when returning search results.

Does the "mount point" idea work here?
Comment 39 Joachim Wagner 2022-01-19 11:05:07 UTC
(In reply to tagwerk19 from comment #38)
> Symlinks provide a bit of an edge case :-)
> [...]
> Does the "mount point" idea work here?

I don't know the internals of the indexer implementation so I cannot say for sure. I would have thought the current indexer calls `stat()` on every file and therefore will have no problem noticing it is on a different filesystem after following a symlink. If following symlinks would pose a problem to the current indexer this means the indexer works differently than I thought.

Switching to using the mount point, filesystem ID and subvolid, I'd again have assumed these three are queried for every file to be index (using a volatile cache with stat.st_dev as the key to speed things up). If this check is performed for every file to be indexed I don't see how there would be any problem when following symlinks, other than surprising users who thought that adding a folder to "Do not search in these locations" (GUI) will exclude its contents from the index.
Comment 40 Lukas Ba. 2022-03-04 21:50:58 UTC
My baloo index file is 32GiB large right now, more than any other folder on my file system, and my file system is filled up by 100%, my PC crashed during an update and doesn't boot anymore because there is no linux kernel. Thanks baloo.
Comment 41 tagwerk19 2022-03-05 12:12:32 UTC
(In reply to Lukas Ba. from comment #40)
> My baloo index file is 32GiB large right now, more than any other folder on
> my file system, and my file system is filled up by 100%, my PC crashed
> during an update and doesn't boot anymore because there is no linux kernel.
> Thanks baloo.
OpenSUSE? (and multiple BTRFS subvolumes)?
Comment 42 Lukas Ba. 2022-03-05 12:20:01 UTC
(In reply to tagwerk19 from comment #41)
> (In reply to Lukas Ba. from comment #40)
> > My baloo index file is 32GiB large right now, more than any other folder on
> > my file system, and my file system is filled up by 100%, my PC crashed
> > during an update and doesn't boot anymore because there is no linux kernel.
> > Thanks baloo.
> OpenSUSE? (and multiple BTRFS subvolumes)?

ArchLinux, with multiple BTRFS subvolumes, my setup is described here

https://wiki.archlinux.org/title/Snapper#Suggested_filesystem_layout
Comment 43 tagwerk19 2022-03-06 07:12:01 UTC
(In reply to Lukas Ba. from comment #42)
> ... ArchLinux, with multiple BTRFS subvolumes ...
You could try the tests in Comment 12, Comment 13

Very likely that baloo is "seeing" your home drive mounted with different minor device numbers and assuming that all the files it sees are new files. Not good.

There's a possible mitigation in Comment 21 that may be worth building on. It is a hack, fragile and I suspect I've seen the Minor Device Number jump even with it in place but I think it's an improvement.

... I missed the fact that Arch runs with BTRFS
Comment 44 Joachim Wagner 2022-03-07 11:12:02 UTC
> [...] home drive mounted with different
> minor device numbers

This phrasing is likely to cause confusion. Better to refer to the value `stat.st_dev` that is used by baloo and that does not have a "minor":

* Block devices have major and minor device numbers that used to be 8 bits each but have been extended to a wider range about 15 years ago. These are stable across restarts for hard drive partitions but are allocated dynamically for device mapper devices, e.g. LUKS encryption layers. NVMe SSDs have been observed to receive a single major such that the minors of the second SSD and its partitions change (at next restart) when the number of partitions on the first SSD is modified.
* Filesystems have a device number that some (single volume) filesystems derive from the device number and other filesystems, e.g. btrfs, set in some other way. They are supposed to be unique for each filesystem over the uptime of a system but may change at each restart. The stat() system call returns this value as `stat.st_dev`. Filesystems with subvolumes produce a different device number for each subvolume.
Comment 45 tagwerk19 2022-03-07 15:48:02 UTC
(In reply to Joachim Wagner from comment #44)
> ... This phrasing is likely to cause confusion ...
Accepted. There are many layers (and history) here, thanks for the explanation.

The challenge is to find a solid procedure for troubleshooting (as in compare the results from "stat" and "balooshow") and simple terms to use when describing them :-/
Comment 46 Lukas Ba. 2022-05-30 01:24:08 UTC
Thank you, Joachim Wagner and tagwerk19@innerjoin.org for your insightful comments, and Stefan Brüns and all the contributors for your efforts.

My inputs:

We need a way to list all the filesystems that are part of the index. (This would increase visibility into what is going on for bugreports and users understanding of what baloo is doing.)
Ideally the command would show the date when the file system was last mounted.

Files on filesystems that are not mounted should not be the result of a search. However, these files should remain on the index, to support the indexing of removable drives that may or may not be mounted at each boot, and should not be cleaned up automatically.

We need a command to clean certain file systems from the index. Also a form of this command to clean all the file systems that are not currently mounted. Some removable drives may never come back and we don't need them on the index anymore, let the user decide if they want to delete them from the index.