Bug 438434 - Baloo appears to be indexing twice the number of files than are actually in my home directory
Summary: Baloo appears to be indexing twice the number of files than are actually in m...
Status: RESOLVED DUPLICATE of bug 401863
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.82.0
Platform: Debian unstable Linux
: NOR normal
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-10 20:10 UTC by Martin Steigerwald
Modified: 2022-03-06 09:33 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Steigerwald 2021-06-10 20:10:26 UTC
SUMMARY

After I thought Baloo completed the initial indexing run on my new laptop, I copied over my home directory and purged the Baloo database, it appeared to just index all of those files again

STEPS TO REPRODUCE

I have no idea.

OBSERVED RESULT

With "balooctl monitor" I saw that Baloo indexes files it must have seen during the first indexing run.

% balooctl status
Die Baloo-Dateiindizierung läuft
Indizierungsstatus: Indexing file content
Gesamtzahl der indizierten Dateien: 1.102.793
Dateien, die noch indiziert werden: 260.161
Dateien, deren Indizierung fehlgeschlagen ist: 0
Der aktuelle Index hat eine Größe von 7,27 GiB


EXPECTED RESULT

But I only have:

find . -not -path '*/\.*' | wc -l
580550

Please never ever index the same files again unless their contents actually changed.



SOFTWARE/OS VERSIONS

Linux/KDE Plasma: Devuan Ceres with KDE packages from Debian Experimental

(available in About System)

KDE Plasma Version: 5.21.5

KDE Frameworks Version: 5.8.2

Qt Version: 5.15.2


ADDITIONAL INFORMATION

"/home" is on a single BTRFS filesystem which is located in top of an LVM on top of a LUKS encrypted partition.

In case this again is something to do with a changing identification of the filesystem, please add an option to tell Baloo: "My $HOME is my $HOME, it is always the same filesystem. Do not ever re-index anything on it, unless it has changed. Thanks."

Actually this should be the default. There is no reason whatsoever to assume that $HOME on an usual desktop PC or laptop will at one point be a completely different filesystem with completely different contain. It should be the default that $HOME is *not* on a removeable device.

Baloo and Akonadi Search managed to somewhat bog down even my new ThinkPad T14 with 8-core AMD Ryzen 7 PRO 4750U and a 2 TiB Samsung 980 Pro. That is just not right. This machine is ridiculously and still that search thing the machine be on quite a load for hours and hours and hours to come. There is something definitely not right here.
Comment 1 tagwerk19 2021-06-10 20:46:46 UTC
It might be worth looking at:
    https://bugs.kde.org/show_bug.cgi?id=402154#c12

Baloo expects the device number / inode for files to be stable (not change every reboot). With certain filessystems/distributions the device number can change, with remote filesystems it seems that the inode can also change.

Try the test with "stat" and "balooshow -x" and see what you see.

The 402154 bug was related to openSUSE and multiple BTRFS subvolumes. It could be that you are caught by the same issue.
Comment 2 Martin Steigerwald 2021-06-10 21:13:02 UTC
I used a BTRFS RAID 1 before, but this time it is not.

"Baloo expects the device number / inode for files to be stable (not change every reboot)"

If it changes though, for whatever reason, even though I use a single BTRFS filesystem on the very same LUKS encrypted partition and on LVM, then the requirement that the device number is stable, is broken.

Remember, we are talking about $HOME. I'd say that in 99% of all desktop use cases, $HOME is not a wildly different filesystem on every reboot. So please, pretty please *stop* relying on an internal operating system detail (device number) to be stable for it.

It is all about usability here. Telling regular users to check whether their device numbers are stable *just* to make indexing work reliable is not going to fly regarding usability. I imagine asking my father checking for a device number… seriously… please stop… relying on OS internals like an inode number or even a device number to be stable.

This assumption is terminally broken, as has been shown here repeatedly.

Do you know any user of KDE Plasma who expects Baloo to reindex their unchanged files in $HOME, just cause they may have a different $HOME on every reboot? If Baloo relies on this, this is at the last a bad design choice. I'd go further than that and I'd say its terminally broken regarding usability.

Imagine if I find out that the device number would change… what would I do? Reinstall the system to match the assumptions of Baloo? Not going to happen.

Do whatever you need to do about removeable media, but just, assume, pretty please assume, that $HOME will be the same directory tree on the same laptop for years to come. And even if I copy it to another laptop… why would Baloo even care? It is still the very same directory tree. Nothing, I repeat, nothing of interest for Baloo has changed. Baloo has no business whatsoever to use the device number for anything related to indexing.

Pretty please consider this input instead of dismissing it. The functionality is broken cause it relies on a broken assumption. Please fix it.

Thank you dearly for your consideration.
Comment 3 Stefan Brüns 2021-06-10 21:32:10 UTC
I no longer work on Baloo, rude behavior by various users had made me stop.

This rude behavior includes treating me like an idiot.

Stop assuming you can make any demands, without giving back.
Comment 4 Nate Graham 2021-06-10 21:39:04 UTC
Mr. tagwerk19, you seem to be knowledgeable about Baloo; would you be interested in doing some development on it? We seem to be down one maintainer, so the field is wide open. :)
Comment 5 tagwerk19 2021-06-10 21:47:18 UTC
In the case here, does the info given by "stat" change on a reboot? Is it an instance of Bug 402154 or is it something new/something else?

I see you've been through all this before, cf Bug 404057, and can see that there's something that needs to be solved.
Comment 6 Martin Steigerwald 2021-06-11 14:52:47 UTC
@Stefan: I am grateful for all you did for Baloo. I know you improved it quite a bit. So thank you. There is nothing personal in here. First off I do not even know who implemented the dependency on the device number. But also in the case you did: You are not your code and you are also not your decision to do so. As you stepped down as a maintainer I am going to work on this with anybody who is willing to consider my report.

@Tagwerk19. Thank you for your response. I am willing to provide the information you requested. Here it is:

I created a file and told Baloo to index it with "balooctl index".

% LANG=en balooshow -x testfile.txt
23479b00000021 33 2312091 testfile.txt [/home/martin/testfile.txt]
        Mtime: 1623421630 2021-06-11T16:27:10
        Ctime: 1623421630 2021-06-11T16:27:10
        Cached properties:
                Line Count: 1

Internal Info
Terms: Mplain Mtext T5 T8 X20-1 hello penguin 
File Name Terms: Ftestfile Ftxt 
XAttr Terms: 
lineCount: 1

% LANG=en stat testfile.txt
  File: testfile.txt
  Size: 14              Blocks: 8          IO Block: 4096   regular file
Device: 21h/33d Inode: 2312091     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/  martin)   Gid: ( 1000/  martin)
Access: 2021-06-11 16:27:10.953499381 +0200
Modify: 2021-06-11 16:27:10.953499381 +0200
Change: 2021-06-11 16:27:10.953499381 +0200
 Birth: 2021-06-11 16:27:10.953499381 +0200



After reboot I get:

% LANG=en stat testfile.txt
  File: testfile.txt
  Size: 14              Blocks: 8          IO Block: 4096   regular file
Device: 21h/33d Inode: 2312091     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/  martin)   Gid: ( 1000/  martin)
Access: 2021-06-11 16:29:54.518299009 +0200
Modify: 2021-06-11 16:27:10.953499381 +0200
Change: 2021-06-11 16:27:10.953499381 +0200
 Birth: 2021-06-11 16:27:10.953499381 +0200

% LANG=en balooshow -x testfile.txt
23479b00000021 33 2312091 testfile.txt [/home/martin/testfile.txt]
        Mtime: 1623421630 2021-06-11T16:27:10
        Ctime: 1623421630 2021-06-11T16:27:10
        Cached properties:
                Line Count: 1

Internal Info
Terms: Mplain Mtext T5 T8 X20-1 hello penguin 
File Name Terms: Ftestfile Ftxt 
XAttr Terms: 
lineCount: 1


Of course that is no guarantee that the device number did not change as the re-indexing of already indexed files happened.

I did another "balooctl purge" and it now is indexing a reasonable amount of files:

% find . -type f -not -path '*/\.*' | wc -l
521488

% LANG=en balooctl status
Baloo File Indexer is running
Indexer state: Indexing file content
Total files indexed: 551,532
Files waiting for content indexing: 490,887
Files failed to index: 0
Current size of index is 367.83 MiB

And yes, I have been through quite some issues with Baloo (and Akonadi Search). There is another one with a file that causes the file extractor to go bonkers. I excluded the directory it is in.

I will keep that test file around for a while. Should Baloo try to index all files again I will have another look at the device number. Still I think it is broken to assume that the device number does not change. There are various examples where it can change. One would be a desktop PC with two or more controllers whose drivers compete for sda/sdb/sdc at every boot. I'd assume that all files in the users home directory are always on the very same filesystem by default.
Comment 7 tagwerk19 2021-06-11 17:00:07 UTC
(In reply to Martin Steigerwald from comment #6)
> ... Should Baloo try to index all
> files again I will have another look at the device number ...
I think it would be a reasonable explanation if you find yourself reindexing everything. It is certainly an issue with openSUSE...

> ... Still I think it
> is broken to assume that the device number does not change ...
OK. Let's say that if baloo can be made proof against this then that's a good thing :-)

> ... file that causes the file extractor to go bonkers
Maybe also in Bug 438074 "Baloo reindexing files on every start". That seems to be focussing in on some specific files/filetypes.

> Der aktuelle Index hat eine Größe von 7,27 GiB
I've seen that the index size and memory use can balloon when deleting entries. Bug 437754.

> @Stefan: I am grateful for all you did for Baloo. I know you improved it
> quite a bit. So thank you ...
I will say the same.

It was baloo and the tag handling in Dolphin that make me a KDE user.

However when I started using KDE "for real", if I renamed a folder tree in Dolphin, I needed to log out and back in again to get back to a responsive system. That was a just a couple of years ago.

You don't necessarily notice the steady development and step by step improvements but I find it remarkable how good baloo is at what it does and how much more solid it has become over the last years.
Comment 8 Martin Steigerwald 2021-06-19 22:08:07 UTC
Tagwerk. I still have:

% LANG=en stat testfile.txt
  File: testfile.txt
  Size: 14              Blocks: 8          IO Block: 4096   regular file
Device: 21h/33d Inode: 2312091     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/  martin)   Gid: ( 1000/  martin)
Access: 2021-06-17 19:32:43.970949695 +0200
Modify: 2021-06-11 16:27:10.953499381 +0200
Change: 2021-06-11 16:27:10.953499381 +0200
 Birth: 2021-06-11 16:27:10.953499381 +0200

Yet Baloo does this:
LANG=en balooctl status
Baloo File Indexer is running
Indexer state: Indexing file content
Total files indexed: 1,087,500
Files waiting for content indexing: 269,002
Files failed to index: 0
Current size of index is 7.06 GiB

It currently indices files in my home directory that it should have picked up in the initial run and that did not change in between. It appears to me that it is indexing all the files again.
Comment 9 tagwerk19 2021-06-20 10:33:26 UTC
So, you've purged the database and "baloo_index" counted 551,532 files to index - then come back a bit later and it says 1,086,500 (and is ploughing slowly through them)

The "device number" stat shows for the test file hasn't changed, so no immediate explanation for where baloo found the "doubled" files.

If you search for the testfile - or maybe a file that "balooctl monitor" shows as just having been indexed - and search for it

    baloosearch --id ...fileindexedmorethanonce...

Do you get two/several results?

The --id option seems to be new and you can see the inode/device number in the id. Thanks are due to skierpage for pointing it out in Bug 438527 :-)

You can give the "id" to balooshow -x and get the indexed details, including the device number/inode of the file as indexed. So something like:

    balooshow -x 1000be0000fc01

Maybe one small step further forward?
Comment 10 Martin Steigerwald 2021-06-20 11:36:32 UTC
% baloosearch --id testfile.txt | grep testfile.txt | head -2
Verstrichen: 14,6812 msec
23479b00000020 /home/martin/testfile.txt
23479b00000021 /home/martin/testfile.txt


% LANG=en balooshow -x 23479b00000020
23479b00000020 32 2312091 /home/martin/testfile.txt
        Mtime: 1623421630 2021-06-11T16:27:10
        Ctime: 1623421630 2021-06-11T16:27:10

Internal Info
Terms: Mplain Mtext T5 T8 
File Name Terms: Ftestfile Ftxt 
XAttr Terms: 


% LANG=en balooshow -x 23479b00000021
23479b00000021 33 2312091 /home/martin/testfile.txt
        Mtime: 1623421630 2021-06-11T16:27:10
        Ctime: 1623421630 2021-06-11T16:27:10
        Cached properties:
                Line Count: 1

Internal Info
Terms: Mplain Mtext T5 T8 X20-1 hello penguin 
File Name Terms: Ftestfile Ftxt 
XAttr Terms: 
lineCount: 1


Does that mean that Baloo saw two different device numbers (32 and 33)?

There have been several reboots and it may be that at a certain point the device number has been different, I just checked for the device number as I actually noticed Baloo was re-indexing files. So maybe the re-indexing was triggered by a different device number from a boot in between?

I still think relying on the device number creates more problems than it solves. I hope you understand that I am not willing to change my setup dm-crypt with LUKS, LVM and then BTRFS (single) on top of it, to be able to guarantee a stable device number. I do think the device number is not supposed to be of relevance for any application and is not guaranteed to be stable in Linux, but I can certainly ask Linux kernel developers about their take on this.

But maybe I misread the above output. Anyway, I hope it helps.
Comment 11 tagwerk19 2021-06-20 13:08:20 UTC
(In reply to Martin Steigerwald from comment #10)
> ... I hope you understand that I am not willing to change my setup
> dm-crypt with LUKS, LVM and then BTRFS (single) on top of it, to be able to
> guarantee a stable device number ...
Absolutely...

However baloo depends on having some sort of "invariant" for a file. Depending on a filename/path would also leave the system vulnerable to the random renaming of large directory trees or remounting something under a different mount point.

> ... I do think the device number is not
> supposed to be of relevance for any application and is not guaranteed to be
> stable in Linux, but I can certainly ask Linux kernel developers about their
> take on this ...
Search/indexing is somehow "in the middle" between being an application and system software. It seems to need to know deeper stuff (maybe things like Dropbox also need such knowledge)

Yes. If there's any magic way of asking for a vol or subvol mount to be "at" a given device number, that would sidestep around the problem. A forlorn, optimistic hope perhaps - but who knows?

> Does that mean that Baloo saw two different device numbers (32 and 33)?
> 
> There have been several reboots and it may be that at a certain point the
> device number has been different, I just checked for the device number as I
> actually noticed Baloo was re-indexing files. So maybe the re-indexing was
> triggered by a different device number from a boot in between?
I think so...

I've no practical experience of your stack but I've see far worse with openSUSE's BTRFS setup 8-/

> But maybe I misread the above output. Anyway, I hope it helps.
I'll say thank you for persisting. If you find any workrounds, let us know...
Comment 12 Martin Steigerwald 2021-06-25 08:15:19 UTC
Now I got confirmation that the device number can be different. This boot:

% LANG=en stat testfile.txt
  File: testfile.txt
  Size: 14              Blocks: 8          IO Block: 4096   regular file
Device: 20h/32d Inode: 2312091     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/  martin)   Gid: ( 1000/  martin)
Access: 2021-06-24 16:08:54.489824537 +0200
Modify: 2021-06-11 16:27:10.953499381 +0200
Change: 2021-06-11 16:27:10.953499381 +0200
 Birth: 2021-06-11 16:27:10.953499381 +0200

And Baloo appears to index all the files another time again:

% LANG=en balooctl status
Baloo File Indexer is running
Indexer state: Suspended
Total files indexed: 1,191,425
Files waiting for content indexing: 278,307
Files failed to index: 0
Current size of index is 8.12 GiB

For NFS there is a fsid= mount option to specify the filesystem ID. Maybe that can help.
Comment 13 tagwerk19 2021-06-26 07:56:33 UTC
(In reply to Martin Steigerwald from comment #12)
> Now I got confirmation that the device number can be different.
If I follow the sequence...

You've had baloo initially indexing files with 21h minor device number. Then your $HOME reappeared with a 20h minor device number, the number of files indexed doubled and a new round of content indexing started (which may not have yet finished)

Possibly you'd jumped back to 21h (cannot really say) but you are now back with 20h and the indexing is continuing...

I think I'd be more worried if a new and different device number appeared, then you'd in for an impossible job :-/

> For NFS there is a fsid= mount option to specify the filesystem ID. Maybe
> that can help.
I've tried putting fsid's in the /etc/fstab for by BTRFS mounts but they seems not to do anything. Worth a try :-)
Comment 14 Martin Steigerwald 2021-06-26 09:05:32 UTC
Dear tagwerk19, I finally asked Linux kernel developers on fixed device numbers:

Assumption on fixed device numbers in Plasma's desktop search Baloo

https://lore.kernel.org/linux-block/1769070.0rzTUBzp5V@ananda/T/#t

In there I got different opinions back. For one thing Qu Wenruo argued that also find uses device numbers during runtime to see whether it crosses filesystem boundaries. But on the other hand I do not think that "find" relies on them to be stable across reboots.

Neil Brown clearly said that no userspace component can rely on device numbers since kernel 2.4. Luckily he recommended an alternative:

"That is really hard to provide in general.  Possibly the best approach
is to use the statfs() systemcall to get the "f_fsid" field.  This is
64bits.  It is not supported uniformly well by all filesystems, but I
think it is at least not worse than using the device number.  For a lot
of older filesystems it is just an encoding of the device number.

For btrfs, xfs, ext4 it is much much better."

https://lore.kernel.org/linux-block/1769070.0rzTUBzp5V@ananda/T/#m28b8c889c9289ad1ec76cbf040938ea883e3f375

How about doing that? According to Qu Wenruo unlike filesystem UUID which is the same for all subvolumes it would also work for BTRFS cause it XOR'd the subvolume id into the filesystem id when using that system call.

While I still may find a work-around I think this approach could solve a lot of the issues that arise from Baloo relying on stable device numbers. And for filesystems not supporting it, it would at least not be worse than before. I bet more KDE Plasma users are using BTRFS, XFS, Ext4 anyway.

Of course for BSD you would need to see for a different solution or use the current approach, in case it does not have that functionality. I have no idea what functionality BSD provides there. But for Linux I think this could be a viable alternative.

What do you think?
Comment 15 Martin Steigerwald 2021-06-26 10:12:29 UTC
One possible drawback for BTRFS could be: In case someone changes the subvolume that is mounted for /home Baloo would re-index files. However… that still would be preferable I think. Also I'd probably combine the statfs() fsid approach which an approach to tell Baloo "/home" or another path is persistent. Actually I think in 99+% of all cases it is.

According to what I gathered the device number could change in several cases:

- BTRFS and/or LVM are in use and the order of doing things might change.
- In a desktop machine with several controllers there would be a driver loading race conditation
- Even between different mounts, especially with Systemd, they probably would not be mounted in the same order.

BTRFS as well as LVM uses so called "anonymous" device numbers. From what I understand these are dynamically allocated device numbers. These are only valid during run-time.

So first step would be:

- What for does Baloo need an invariant for the file?
- Why wouldn't a rename mess things up without an invariant (device number or filesystem id)? Or otherwise put how would having device/filesystem unique invariant help with a rename? I bet you'd need a file/directory based invariant for that. I.e. a hash value for each file.

I think also regarding the energy efficiency goal it would be good to revisit all of this and come up with an approach that avoids clearly needless indexing work. I bet that indexing files and mails is easily the most energy and resource consuming aspect of Plasma desktop and KDE applications.
Comment 16 tagwerk19 2021-06-26 11:46:58 UTC
(In reply to Martin Steigerwald from comment #14)
> ...  Possibly the best approach
> is to use the statfs() systemcall to get the "f_fsid" field.  This is
> 64bits.  It is not supported uniformly well by all filesystems, but I
> think it is at least not worse than using the device number ...
I see that "stat -f testfile.txt" gives a 64-bit ID.

I've been comparing that to the minor device number and BTRFS subvolid in openSUSE. That ID appears stable (in my very constrained tests). I wasn't able to dig up a lot about a "64 bit" fsid with the help of Google...

> ... And
> for filesystems not supporting it, it would at least not be worse than
> before ...
It's not clear how to find that out :-)

The kernel.org thread does look interesting through, let me see if I can follow all the subtleties. I did try "requesting" mounts to be done in a particular order (via x-systemd.requires). No joy...
    https://bugs.kde.org/show_bug.cgi?id=402154#c24

> What do you think?
We're dependent on a willing developer. Alas, that's not my forte ...
Comment 17 tagwerk19 2021-06-26 22:55:44 UTC
(In reply to Martin Steigerwald from comment #15)
> What for does Baloo need an invariant for the file?
As I understand it... internally, it is the key within the index. It also allows "missed changes" to be reconciled if baloo is not running when the file is changed or has missed the inotify.

> Why wouldn't a rename mess things up without an invariant (device number
> or filesystem id)? Or otherwise put how would having device/filesystem
> unique invariant help with a rename?
I think "the trap" is to avoid reindexing everything in a large folder tree if you rename the top foldername. You need a way to tell if oldtree/x/y/z is the same file as newname/x/y/z or not...

From my experience, baloo has to react to inotify events and also be able to smoothly recover/catch up if the events are missed.

> ... I bet you'd need a file/directory based
> invariant for that. I.e. a hash value for each file ...
Baloo also allows you to index the filename/metadata and not index the content. A hash would be extra work here...
Comment 18 tagwerk19 2021-07-06 10:02:43 UTC
Was able to replicate, flagging as "Confirmed"
Comment 19 Martin Steigerwald 2021-08-02 07:07:31 UTC
I switched Baloo to just indexing filenames not contents cause it was so unbearable for me.

There is a new discussion on how to deal with BTRFS/nfsd subvol dev/inode number issues and how to allow user space to compare two items for real.

Starting here:

A Third perspective on BTRFS nfsd subvol dev/inode number issues.

https://lore.kernel.org/linux-btrfs/CAJfpegub4oBZCBXFQqc8J-zUiSW+KaYZLjZaeVm_cGzNVpxj+A@mail.gmail.com/T/#m45d0820a1e660ce28c79992a829588de67fd38c3

One interim suggestion is for BTRFS to use hashed inode numbers that are unique in most cases. However ultimately Neil Brown suggests to tell user space developers to use a new way to compare whether items are the same:

"The "obvious" choice for a replacement is the file handle provided by
name_to_handle_at() (falling back to st_ino if name_to_handle_at isn't
supported by the filesystem).  This returns an extensible opaque
byte-array.  It is *already* more reliable than st_ino.  Comparing
st_ino is only a reliable way to check if two files are the same if you
have both of them open.  If you don't, then one of the files might have
been deleted and the inode number reused for the other.  A filehandle
contains a generation number which protects against this.

So I think we need to strongly encourage user-space to start using
name_to_handle_at() whenever there is a need to test if two things are
the same."

There is a huge discussion following this. I do not have the time to review it right now, however there might be something in it in order to make Baloo work for these use cases.
Comment 20 tagwerk19 2021-08-02 18:15:34 UTC
(In reply to Martin Steigerwald from comment #19)
> There is a huge discussion following this. I do not have the time to review
> it right now, however there might be something in it in order to make Baloo
> work for these use cases.
Many thanks for keeping watch on the topic and there is indeed a lot to read through.

Do you think this:

https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.stgit@noble.brown/

could imply that the major:minor device numbers, as seen by stat (and baloo), start relating to the subvol? cf:

    There are long-standing problems with btrfs subvols, particularly in
    relation to whether and how they are exposed in the mount table.

     - /proc/self/mountinfo reports the major:minor device number for each
        filesystem and when a btrfs subvol is explicitly mounted, the number
        reported is wrong - it does not match what stat() reports for the
        mountpoint.

But there does seem to be a wide range of options put forward and it's not really clear what the front runner is.

For me, name_to_handle_at() returns a 20 byte handle. Having such an invariant is good, but it is big...

Thanks again...
Comment 21 Nate Graham 2022-03-05 14:47:30 UTC

*** This bug has been marked as a duplicate of bug 401863 ***
Comment 22 Martin Steigerwald 2022-03-05 22:34:49 UTC
(In reply to tagwerk19 from comment #20)
> (In reply to Martin Steigerwald from comment #19)
> > There is a huge discussion following this. I do not have the time to review
> > it right now, however there might be something in it in order to make Baloo
> > work for these use cases.
> Many thanks for keeping watch on the topic and there is indeed a lot to read
> through.
> 
> Do you think this:
> 
> https://lore.kernel.org/linux-btrfs/162742539595.32498.13687924366155737575.
> stgit@noble.brown/
> 
> could imply that the major:minor device numbers, as seen by stat (and
> baloo), start relating to the subvol? cf:

Tagwerk, this is not only related to BTRFS. As established before, device major:minor numbers by the kernel are not guaranteed to be stable across reboots.

Using is as a static identifier inside Baloo thus, in my humble opinion, is a design mistake.

About the alternatives, there are quite some, I am not completely decided on which one would be best.

But unless there is an willingness to actually consider replacing using minor:major number with something else, there is no point to discuss this further I'd say.
Comment 23 tagwerk19 2022-03-06 07:08:16 UTC
(In reply to Martin Steigerwald from comment #22)
> ...  this is not only related to BTRFS ...
That's understood.

Any fix to (specifically) BTRFS mounts would be like applying a sticking plaster; better than trying to mitigate by mounting devices in a specific order; maybe not as good as being able to specify a "would rather like" Minor Device number in a mount command.

It is a "Plan B" though in the absence of a determined developer who's willing to take up the Baloo reengineering work and the adoption of BTRFS in distros.
Comment 24 Martin Steigerwald 2022-03-06 09:33:00 UTC
(In reply to tagwerk19 from comment #23)
> (In reply to Martin Steigerwald from comment #22)
> > ...  this is not only related to BTRFS ...
> That's understood.
[…]
> It is a "Plan B" though in the absence of a determined developer who's
> willing to take up the Baloo reengineering work and the adoption of BTRFS in
> distros.

Well, I am not sure whether any of what they discuss about in this thread has been
merged yet. It is has, I should have it already or soon, as I am currently using
5.17-rc6 kernel.

So far I think I still have this indexing the same files twice and thrice and so on issue,
but I can keep an eye on it.

I replied to this large thread and Neil replied to me then:

"> Bug 438434 - Baloo appears to be indexing twice the number of files than 
> are actually in my home directory
> 
> https://bugs.kde.org/438434

This bug wouldn't be address by using the filehandle.  Using a
filehandle allows you to compare two files within a single filesystem.
This bug is about comparing two filesystems either side of a reboot, to
see if they are the same.

As has already been mentioned in that bug, statfs().f_fsid is the best
solution (unless comparing the mount point is satisfactory)."

https://lore.kernel.org/linux-btrfs/CAJfpegub4oBZCBXFQqc8J-zUiSW+KaYZLjZaeVm_cGzNVpxj+A@mail.gmail.com/T/#meaf736156e0937728e63c6fdc69376a5f4b02af2