Bug 433204

Summary:	baloo_file SEGV
Product:	[Frameworks and Libraries] frameworks-baloo	Reporter:	Peter <p.wibberley>
Component:	general	Assignee:	Stefan Brüns <stefan.bruens>
Status:	CONFIRMED ---
Severity:	normal	CC:	jr, nate, neon-bugs-null, sitter, tagwerk19, wouter
Priority:	NOR	Keywords:	usability
Version First Reported In:	unspecified
Target Milestone:	---
Platform:	Neon
OS:	Linux
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:

Description Peter 2021-02-18 21:40:59 UTC

As of 17 December 2020, Dolphin is not displaying any files in timeline. Running 'balooctl status' shows "Baloo File Indexer is not running". I can run 'balooctl enable', immediately after which 'balooctl status' shows that it is running but, within a couple of seconds, running it again shows that it has stopped. No additional files get indexed. (I have just over 200,000 files indexed and the current size of index is 28.73GB.)

On a couple of occasions, running 'balooctl status' gave the message "kf.baloo: KDE Baloo File Indexer has reached the inotify folder watch limit. File changes will be ignored.". This message was repeated about 470 times, following which it reported, "Baloo File Indexer is running". However, it stopped a few seconds later. I tried running "sudo sysctl -w fs.inotify.max_user_watches=16383", increasing the previous value of 8192, but this has not made any difference.

I'm not sure how to obtain any more troubleshooting information. Any advice much appreciated.

Thanks and regards

STEPS TO REPRODUCE
1. In Konsole, run "balooctl status".
2. Run "balooctl enable".
3. Run "balooctl status" two or three times more.

OBSERVED RESULT
'balooctl status' reports "Baloo File Indexer is not running"
'balooctl enable' reports "Enabling and starting the File Indexer"
'balooctl status' reports "Baloo File Indexer is running" if run immediately, but then "Baloo File Indexer is not running" 3 or 4 seconds later.

EXPECTED RESULT

'balooctl status' should report "Baloo File Indexer is running".

SOFTWARE/OS VERSIONS
Linux/KDE Plasma: KDE Neon Plasma LTS 5.18
(available in About System)
KDE Plasma Version: 5.18.6
KDE Frameworks Version: 5.79.0
Qt Version: 5.15.2

ADDITIONAL INFORMATION
Kernel Version 5.4.0-65-generic

Comment 1 Stefan Brüns 2021-02-19 08:16:34 UTC

fs.inotify.max_user_watches=8192 is a ridiculously low value, even 16384 is. Is that the Neon default?

Comment 2 Nate Graham 2021-02-19 20:10:33 UTC

Indeed it is, at least based on my wife's Neon system in its default configuration. That is indeed absurdly low. On my openSUSE Tumbleweed system, it's 65536.

I wonder if this could be what's behind a lot of the "Baloo doesn't work!" complaints we get--at least for Neon users.

Moving to Neon; hopefully this can be increased!

Comment 3 tagwerk19 2021-02-20 08:35:40 UTC

> ... File changes will be ignored
When the limit is reached iNotify stops flagging _any_ changes (for that user), not just that adding a new watch fails?

Tweaking the limit gradually down until I start seeing failures, it seems there are 120 (or so) "watches" used before baloo starts adding them. These will also stop reporting when the limit is reached?

Of interest, I don't see 'balooctl status' reporting 'Indexer is not running' after the failures, it still reports 'is running'

Comment 4 Harald Sitter 2021-02-22 10:25:08 UTC

Wasn't there a GUI that tells the user about inotify exhaustion? I distinctly recall Vishesh talking about it at some meeting or other. Though I realize that may have been just musings and what I recall in the back of my mind may have been something nepomuk had. In any case silently ignoring exhaustion seems booh from a UX POV regardless of silly distro limits.

Comment 5 Peter 2021-02-22 12:04:27 UTC

All, 

Thank you for your comments. I presume that the value of 8192 for max_user_watches is the Neon default, as I haven't previously changed it (unless something else did).  I had thought that, even if the value were very low and were causing baloo to fail, then doubling it should have made a difference.  I've now tried 65472 (i.e. 65536 less a bit) (and confirmed the change with 'cat /proc/sys/fs/inotify/max_user_watches') but the problem still persists.  

Do you have any suggestions of how high a value of max_user_watches I should try?  And are there any other diagnostics I can run?  

Many thanks and best regards

Peter

Comment 6 tagwerk19 2021-02-22 13:13:38 UTC

I think the questions to ask are...

    How many folders do I have?
    After I increased the value, do I still get the
        "... File changes will be ignored." messages?
    Am I running anything else that might be watching for changes?

I'm basing my reasoning on

    https://unix.stackexchange.com/questions/13751/kernel-inotify-watch-limit-reached

Comment 7 Harald Sitter 2021-02-22 17:08:49 UTC

FTR 8192 was the default value of the kernel up to 5.11 where it was changed to the dynamic calculation known from epoll https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/notify/inotify/inotify_user.c?id=92890123749bafc317bbfacbe0a62ce08d78efb7

Comment 8 Nate Graham 2021-02-22 18:19:19 UTC

(In reply to Harald Sitter from comment #4)
> Wasn't there a GUI that tells the user about inotify exhaustion?
Hmm, not to my knowledge.

Comment 9 Peter 2021-02-23 10:12:34 UTC

I have no idea whether this is significant.  As per my original message, Dolphin stopped displaying any timeline on 17 December 2020 (which is how I noticed baloo wasn't working).  From an inspection of var/log/apt/history.log, for packages with 'baloo' in the name, there was an update on 13 December 2020 of 

(1) libkf5baloowidgets5:amd64 (4:20.08.3-0xneon+20.04+focal+build8, 4:20.12.0-0xneon+20.04+focal+build9) 

The next update, on 16 December 2020, included:  

(1) libkf5balooengine5:amd64 (5.76.0-0xneon+20.04+focal+build12, 5.77.0-0xneon+20.04+focal+build13)
(2) libkf5baloo5:amd64 (5.76.0-0xneon+20.04+focal+build12, 5.77.0-0xneon+20.04+focal+build13)
(3) baloo-kf5:amd64 (5.76.0-0xneon+20.04+focal+build12, 5.77.0-0xneon+20.04+focal+build13).

The next update after that wasn't until 24 December 2020, with nothing obviously related to baloo being updated.  The next updates of baloo packages were 8 and 11 January 2021. The current versions, according to 'apt list', are all (5.79.0-0xneon+20.04+focal+build17 amd64)

Regards

Comment 10 Jonathan Riddell 2021-02-23 19:54:12 UTC

I compiled configure-inotify and made systemd start the service and I got a new file made

> cat /run/sysctl.d/40-neon-inotify.conf
# This file gets auto-generated by neon-configure-inotify. Changes will be overwritten!
fs.inotify.max_user_watches=129762

I wonder if this should be built into baloo rather than being distro specific

Comment 11 Peter 2021-02-23 20:34:21 UTC

Jonathan, 

Tried 129762, but I still have the problem.  

Incidentally, I tried running 'balooctl monitor' and then 'balooctl enable' from another Konsole.  I got the messages as follows:  

Waiting for file indexer to start
Press Ctrl+C to stop monitoring
File indexer is running
Starting
Indexing modified files
Starting
Checking for stale index entries
Waiting for file indexer to start
Press Ctrl+C to stop monitoring

The 'Total number of files indexed', as reported by 'balooctl status' does not change (on my system remaining at just over 200,000).  

As ever, thanks to everyone for their interest and suggestions.  Any further suggestions for further diagnostics I can run will be much appreciated.   

P

Comment 12 tagwerk19 2021-02-23 22:08:04 UTC

I tried creating a:

    /etc/sysctl.d/40-neon-inotify.conf

file, added the line

    fs.inotify.max_user_watches=129762

and rebooted. Checked with:

    sysctl fs.inotify.max_user_watches

and got:

    fs.inotify.max_user_watches = 129762

A "vanilla approach" but it works. I've not done anything elsewhere to set the value.

Comment 13 Harald Sitter 2021-02-24 11:13:20 UTC

(In reply to Jonathan Riddell from comment #10)
> I wonder if this should be built into baloo rather than being distro specific

It will sort itself with kernel 5.11. Well, to a degree.

If my maths is correct linux 5.11 sets the watch limit to
32G ~> 260k
16G ~> 130k
8G  ~>  65k
4G  ~>  30k

So this can conceivably still be easily exhausted. e.g. the original reporter would need to have >16G or manually set the watch limit to track their 200k files.

As I've said, the UX is fairly weak since we don't inform the user when they run out of inotify capacity. KDirWatch has the similar problem really (albeit more with instances than watches).

Comment 14 Peter 2021-02-24 11:49:46 UTC

(In reply to Harald Sitter from comment #13)

The thing I don't understand is that, up until 17 December, I would have had max_user_watches set at 8192 with the number of files not being significantly less than the current number, and it was working OK.  Now, even a 16-fold increase in the value doesn't seem to make any difference.  Hence I was wondering whether the problem could have been related to updates.  

From my perspective, my problem is getting more diagnostic information. 

Thanks and regards   
 
> (In reply to Jonathan Riddell from comment #10)
> > I wonder if this should be built into baloo rather than being distro specific
> 
> It will sort itself with kernel 5.11. Well, to a degree.
> 
> If my maths is correct linux 5.11 sets the watch limit to
> 32G ~> 260k
> 16G ~> 130k
> 8G  ~>  65k
> 4G  ~>  30k
> 
> So this can conceivably still be easily exhausted. e.g. the original
> reporter would need to have >16G or manually set the watch limit to track
> their 200k files.
> 
> As I've said, the UX is fairly weak since we don't inform the user when they
> run out of inotify capacity. KDirWatch has the similar problem really
> (albeit more with instances than watches).

Comment 15 Harald Sitter 2021-02-24 11:58:32 UTC

I can't really say anything about that I know nothing about baloo. The inotify cap needs fixing one way or another, but if increasing the cap doesn't help you I guess your problem has nothing to do with it.

Comment 16 Peter 2021-02-24 12:28:13 UTC

(In reply to Harald Sitter from comment #15)
> I can't really say anything about that I know nothing about baloo. The
> inotify cap needs fixing one way or another, but if increasing the cap
> doesn't help you I guess your problem has nothing to do with it.

Understood.  Thank you.

Comment 17 tagwerk19 2021-02-24 13:34:19 UTC

I think that baloo_file watches directories (mostly) and not files.

I get this info from a trick of killing the baloo_file process and starting from the command line under strace. So...

    ps ax | grep baloo_file
    kill ...
    strace baloo_file

This gives shedloads of information; details of every system call. You can however see the "inotify_add_watch' calls for each individual directory and their return values - which can either include a steadily increasing count or the 'changes will be ignored' message. You should be able to see a bit more about what's happening

I haven't worked how to see what info is being passed to the baloo_file_extractor process and what it is doing. I also haven't discovered where baloo logs its activities, that would be good to know!

Comment 18 Peter 2021-02-24 14:38:27 UTC

(In reply to tagwerk19 from comment #17)

Tagwerk19,

Thank you for the suggestion.     

No need to kill the process, as it's dying anyway. 

You're not wrong about shedloads: over 20MB and 200,000 lines collected in just a few seconds.  There's 11973 instances or 'inotify_add_watch', with an 'inotify_init1(IN_CLOEXEC) and an 'inotify_init()' very early on. There appears to be several phases with different clusters of error messages.  And it ends with "+++ killed by SIGSEGV +++" and "Segmentation fault", which would seem to explain why it's not running for long. 
 
I don't want to burden you with megabytes of probably superfluous detail, but do you have any suggestions about how I should proceed?  

Thank you again, and regards

P

> I think that baloo_file watches directories (mostly) and not files.
> 
> I get this info from a trick of killing the baloo_file process and starting
> from the command line under strace. So...
> 
>     ps ax | grep baloo_file
>     kill ...
>     strace baloo_file
> 
> This gives shedloads of information; details of every system call. You can
> however see the "inotify_add_watch' calls for each individual directory and
> their return values - which can either include a steadily increasing count
> or the 'changes will be ignored' message. You should be able to see a bit
> more about what's happening
> 
> I haven't worked how to see what info is being passed to the
> baloo_file_extractor process and what it is doing. I also haven't discovered
> where baloo logs its activities, that would be good to know!

Comment 19 tagwerk19 2021-02-24 17:05:20 UTC

I would look though at what baloo_file had started to index when it started to log errors, maybe it's possible to see something strange...

You can exclude folders from indexing, either through system settings / search or by editing the .config/baloofilerc file. Add a line:

    exclude folders[$e]=$HOME/dontindexme

and you can skip index that folder (and its subfolders). Maybe you can pin down a folder that is giving trouble.

Alas I cannot help with the SIGSEGV, need someone with experience of core dumps...

Comment 20 Peter 2021-02-24 22:22:14 UTC

(In reply to tagwerk19 from comment #19)
> I would look though at what baloo_file had started to index when it started
> to log errors, maybe it's possible to see something strange...
> 
> You can exclude folders from indexing, either through system settings /
> search or by editing the .config/baloofilerc file. Add a line:
> 
>     exclude folders[$e]=$HOME/dontindexme
> 
> and you can skip index that folder (and its subfolders). Maybe you can pin
> down a folder that is giving trouble.
> 
> Alas I cannot help with the SIGSEGV, need someone with experience of core
> dumps...

Tagewerk19, 

Not a lot of obvious errors.  However, the last 130,000 lines of the 210,000 lines from strace (except for the final line, "+++ killed by SIGSEGV +++") consist of 43,000 repeats of the three lines, 

ioctl(14, FIONREAD, [48])               = 0

read(14, "\1\0\0\0\2\0\0\0\0\0\0\0 \0\0\000210224Baloo_file"..., 48) = 48

poll([{fd=3, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLPRI}, {fd=11, events=POLLIN}, {fd=14, events=POLLIN}], 5, 10) = 1 ([{fd=14, revents=POLLIN}])

That seems a little strange.

Comment 21 tagwerk19 2021-02-24 22:49:12 UTC

I think we do need someone who knows to come in here...

A couple of observations might help though, I see lines like:

    ioctl(14, FIONREAD, [32])               = 0
    read(14, "\3\0\0\0\0\2\0\0\0\0\0\0\20\0\0\0testfile2.txt\0\0\0", 32) = 32

appear when I write to "testfile2.txt". Could you be logging to a file and that file being indexed?

I've also looked at:

    https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports

It's being actively updated. Maybe it can give a pointer or two.

Comment 22 Peter 2021-02-25 11:33:33 UTC

(In reply to tagwerk19 from comment #21)
> I think we do need someone who knows to come in here...
> 
> A couple of observations might help though, I see lines like:
> 
>     ioctl(14, FIONREAD, [32])               = 0
>     read(14, "\3\0\0\0\0\2\0\0\0\0\0\0\20\0\0\0testfile2.txt\0\0\0", 32) = 32
> 
> appear when I write to "testfile2.txt". Could you be logging to a file and
> that file being indexed?
> 
> I've also looked at:
> 
>    
> https://community.kde.org/Guidelines_and_HOWTOs/Debugging/
> How_to_create_useful_crash_reports
> 
> It's being actively updated. Maybe it can give a pointer or two.

Doh! The file, "210224Baloo_file", which is referenced 42,835 times is the file to which I was redirecting the strace output!  I will exclude this from indexing and see if I get more interesting (and if not more interesting, at least less) output.    

Many thanks.

Comment 23 tagwerk19 2021-02-25 22:46:16 UTC

You're the first one that's happened to :-)

So, I blindly followed the "How to create useful crash reports" and was able to run

    coredumpctl

and get a list of crash dumps, with their "process numbers', the most recent at the bottom. Then

    coredumpctl gdb <process number>

and after lots of stuff had scrolled off the screen

    bt

This was in "Neon Testing".

For "Neon" I first had to install

    systemd-coredump

reboot and then trigger a crash to get the same thing.

... Every day a learning day.

Comment 24 Harald Sitter 2021-03-01 12:11:19 UTC

Bouncing the bug back to baloo since the original report seems to have a different problem.

neon now has a tool that sets max_user_watches to a similar value as linux>=5.11 on bootup.

Comment 25 tagwerk19 2021-03-01 12:59:41 UTC

(In reply to tagwerk19 from comment #23)
> You're the first one that's happened to :-)
Mea Culpa.

My apologies, that should have course read

  You're *not* the first one that's happened to :-)

Comment 26 Stefan Brüns 2021-03-01 13:40:51 UTC

(In reply to Jonathan Riddell from comment #10)
> I compiled configure-inotify and made systemd start the service and I got a
> new file made
> 
> > cat /run/sysctl.d/40-neon-inotify.conf
> # This file gets auto-generated by neon-configure-inotify. Changes will be
> overwritten!
> fs.inotify.max_user_watches=129762
> 
> I wonder if this should be built into baloo rather than being distro specific

This is a hard limit, not changeable by (baloo on behalf of the) user.

Comment 27 Stefan Brüns 2021-03-01 13:44:03 UTC

(In reply to Harald Sitter from comment #13)
> (In reply to Jonathan Riddell from comment #10)
> > I wonder if this should be built into baloo rather than being distro specific
> 
> It will sort itself with kernel 5.11. Well, to a degree.
> 
> If my maths is correct linux 5.11 sets the watch limit to
> 32G ~> 260k
> 16G ~> 130k
> 8G  ~>  65k
> 4G  ~>  30k
> 
> So this can conceivably still be easily exhausted. e.g. the original
> reporter would need to have >16G or manually set the watch limit to track
> their 200k files.

Wrong. Baloo watches directories, not files.

Comment 28 Harald Sitter 2021-03-01 14:06:21 UTC

Good to know. Then the limits are probably sufficient enough 👍

Comment 29 Peter 2021-03-01 14:13:44 UTC

(In reply to tagwerk19 from comment #25)
> (In reply to tagwerk19 from comment #23)
> > You're the first one that's happened to :-)
> Mea Culpa.
> 
> My apologies, that should have course read
> 
>   You're *not* the first one that's happened to :-)

No worries. From the tone of your comment, I just assumed you were pulling my leg!

Comment 30 Peter 2021-03-03 22:04:13 UTC

Following tagwerk19's suggestion (Comment21), I've rerun 'strace baloo_file', making sure not to index the file I'm logging the strace output to.  This has saved about 130,000 lines of output.  Thank you, tagwerk19!  

The final 20 lines of strace output, from Line 85841 onwards, are now, 

openat(AT_FDCWD, "/home/peter/Music/Marc Maron", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 15
fstat(15, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getdents64(15, /* 10 entries */, 32768) = 520
getdents64(15, /* 0 entries */, 32768)  = 0
close(15)                               = 0
openat(AT_FDCWD, "/home/peter/Music/Dirk Gently Series x", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 15
fstat(15, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getdents64(15, /* 8 entries */, 32768)  = 528
getdents64(15, /* 0 entries */, 32768)  = 0
close(15)                               = 0
poll([{fd=3, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLPRI}, {fd=11, events=POLLIN}, {fd=14, events=POLLIN}], 5, 2577) = 0 (Timeout)
write(3, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=3, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLPRI}, {fd=11, events=POLLIN}, {fd=14, events=POLLIN}], 5, 0) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\1\0\0\0\0\0\0\0", 16)         = 8
futex(0x55d72a31eca0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55d72a31ec50, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55d72a043578, FUTEX_WAKE_PRIVATE, 1) = 1
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=3, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLPRI}, {fd=11, events=POLLIN}, {fd=14, events=POLLIN}], 5, 115985 <unfinished ...>) = ?
+++ killed by SIGSEGV +++ 

So, I'm still getting "+++ killed by SIGSERV +++", which presumably explains why baloo stops running.    

A bit odd, is the 3 "poll" messages.  Earlier in the  log file, these occur only about once every 7000 lines, until this trio of them in the final few lines. 

There's about 12000 instances of inotify_add_watch.  Running 'find .  -type d -print| wc -l' tells me I have nearly 36,000 directories.  I don't know how many of these are excluded from file indexing, so I can't say whether I've got an 'inotify_add_watch' for every indexed directory, but I guess it's possible.   

Earlier this evening, I was getting only 8000 instances of inotify_add_watch, until I realised that max_user_watches had reverted to 8192, and I increased it up to 129762 again. Before increasing the limit, there was also "ENOSPC (No space left on device)" appended to the last 450 or so 'inotify_add_watch' messages.  At the moment, I can't think of anything that could have suddenly increased the number of directories way over 8000 or so on or about 17th December, when the problem started, but may be there was.  But when I increase the limit to circa 130,000, the problem doesn't go away:  could there also be some parameter, other than max_user_watches, which is still too low, and causing the SIGSEGV fault?         

Any clues here as to what's going wrong?  

Thanks and regards

Comment 31 tagwerk19 2021-03-04 10:02:15 UTC

(In reply to Peter Wibberley from comment #30)
> ... Any clues ...
First off, I'd say create a file
    /etc/sysctl.d/40-neon-inotify.conf
as per comment #12 (at least until 5.11). That would set the max_user_watches every reboot.

Then I'd see if
    coredumpctl
gave a list of dumps as per comment #23

If it does, and you can get a backtrace, that would be a step forward.

Comment 32 tagwerk19 2021-03-21 14:51:01 UTC

(In reply to Harald Sitter from comment #7)
> ... of the kernel up to 5.11 where it was changed to the dynamic calculation ...
Fedora 33 just got 5.11 and I get
    fs.inotify.max_user_watches = 524288
in a 4 GB VM guest

Comment 33 tagwerk19 2021-03-21 15:18:31 UTC

Apologies..

Looks like Fedora was a bad test case, it already had max_user_watches set :-/

Comment 34 Peter 2021-05-24 21:24:57 UTC

(In reply to tagwerk19 from comment #31)
> (In reply to Peter Wibberley from comment #30)
> > ... Any clues ...
> First off, I'd say create a file
>     /etc/sysctl.d/40-neon-inotify.conf
> as per comment #12 (at least until 5.11). That would set the
> max_user_watches every reboot.
> 
> Then I'd see if
>     coredumpctl
> gave a list of dumps as per comment #23
> 
> If it does, and you can get a backtrace, that would be a step forward.

Tagwerk19, 

Sorry about the long gap, but life got in the way for a couple of months.  

In the meantime, I migrated from Neon LTS to the User Edition (using the instructions at https://userbase.kde.org/Neon/LTS/EOL).  I had hoped that this might cure the problem, but alas not.  

The situation now is as follows:  

(1) If I use the default value for fs.inotify.max_user_watches of 8192 and run 'balooctl enable', I get a stream of messages, "kf.baloo: KDE Baloo File Indexer has reached the inotify folder watch limit. File changes will be ignored."

(2) If I just double fs.inotify.max_user_watches to only 16384, and run 'balooctl enable', the watch limit error messages stop.  However, 'balooctl status' shows the Baloo File Indexer as stopping after a few seconds.  

(3) If I run 'balooctl enable', wait for it to stop, and then run 'coredumpctl gdb <PID>', as you recommended, I get, 

#0  0x00007f1aff1919f4 in ?? () from /usr/lib/x86_64-linux-gnu/liblmdb.so.0
#1  0x00007f1aff194e80 in ?? () from /usr/lib/x86_64-linux-gnu/liblmdb.so.0
#2  0x00007f1aff195165 in ?? () from /usr/lib/x86_64-linux-gnu/liblmdb.so.0
#3  0x00007f1aff195892 in ?? () from /usr/lib/x86_64-linux-gnu/liblmdb.so.0
#4  0x00007f1aff195ed4 in mdb_get () from /usr/lib/x86_64-linux-gnu/liblmdb.so.0
#5  0x00007f1b0069d458 in Baloo::IdFilenameDB::get(unsigned long long) () from /usr/lib/x86_64-linux-gnu/libKF5BalooEngine.so.5
#6  0x00007f1b006921aa in Baloo::DocumentUrlDB::get(unsigned long long) const () from /usr/lib/x86_64-linux-gnu/libKF5BalooEngine.so.5
#7  0x00007f1b006afa19 in Baloo::Transaction::documentUrl(unsigned long long) const () from /usr/lib/x86_64-linux-gnu/libKF5BalooEngine.so.5
#8  0x000055fa2c6779f3 in ?? ()
#9  0x00007f1b006be7ac in Baloo::WriteTransaction::removeRecursively(unsigned long long, std::function<bool (unsigned long long)> const&) () from /usr/lib/x86_64-linux-gnu/libKF5BalooEngine.so.5
...
#65486 0x00007f1b006be7fe in Baloo::WriteTransaction::removeRecursively(unsigned long long, std::function<bool (unsigned long long)> const&) () from /usr/lib/x86_64-linux-gnu/libKF5BalooEngine.so.5
#65487 0x000055fa2c67811c in ?? ()
#65488 0x00007f1b001f5ff2 in ?? () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#65489 0x00007f1b001f2bec in ?? () from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#65490 0x00007f1aff200609 in start_thread (arg=<optimised out>) at pthread_create.c:477
#65491 0x00007f1affd25293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(4) I've also run strace when the indexer is still running and again when it's stopped, and the outputs look completely different.  If it's running, the strace terminates in an orderly manner, but if the indexer isn't running then I get the SEGV fault. The fault seems to occur when a poll command gets a timeout.  

No idea what to make of all this!  I would try a fresh install, but I had hoped the switch to the User Edition would achieve the same thing.  Any suggestions?  

As ever, many thanks and regards

Peter

Comment 35 tagwerk19 2021-05-26 22:54:55 UTC

(In reply to Peter Wibberley from comment #34)
> No idea what to make of all this!  I would try a fresh install, but I had
> hoped the switch to the User Edition would achieve the same thing.  Any
> suggestions?  
Feels like a handful of random jigsaw puzzle pieces :-)

I think you are now OK with the "max_user_watches". The baloo_file process is picking up all your directories and able to watch for changes.

I'm not sure what to make of the dump, can always hope that someone else jumps in to help. It could be that you'll get a "you need to install the debug symbols". The "??"s in the dump would then appear as function names / line numbers

Some speculation though...

It looks as if the crash happens when updating the index, not when reading any particular file; "liblmdb" is the library that accesses the index. It also looks as if baloo is trying to delete a "load" of index entries. This fits with your "Checking for state index entries" in Comment 11.

And early on you said your index file was 28 GB, which is, well, a lot. You can get a breakdown of how much free space in the database with "baloo indexSize". I wonder though whether the next step is to disable baloo, rename the baloo "index" file so it is out of harms way, and reenable to start the indexing process again. It could take a while...

Maybe you can "phase" the reindexing, starting with a few folders, checking the behaviour, and then adding more

Comment 36 tagwerk19 2021-05-26 22:57:04 UTC

(In reply to tagwerk19 from comment #35)
>   baloo indexSize
No....
    balooctl indexSize

Comment 37 Wouter Haffmans 2022-11-02 20:59:45 UTC

I'm getting the same issue still on ArchLinux (baloo 5.99.0-1) with an infinite loop at WriteTransaction::removeRecursively.

Running "gdb /usr/lib/baloo_file", the crash seems to happen when the stack is full of identical frames:

#16 0x00007ffff7e0bfc5 in Baloo::WriteTransaction::removeRecursively(unsigned long long, std::function<bool (unsigned long long)> const&) (this=0x7fbfdc004df0, parentId=36028801313933346, shouldDelete=...)
    at /usr/src/debug/baloo-5.99.0/src/engine/writetransaction.cpp:167
#17 0x00007ffff7e0bfc5 in Baloo::WriteTransaction::removeRecursively(unsigned long long, std::function<bool (unsigned long long)> const&) (this=0x7fbfdc004df0, parentId=36028801313933346, shouldDelete=...)
    at /usr/src/debug/baloo-5.99.0/src/engine/writetransaction.cpp:167
#18 0x00007ffff7e0bfc5 in Baloo::WriteTransaction::removeRecursively(unsigned long long, std::function<bool (unsigned long long)> const&) (this=0x7fbfdc004df0, parentId=36028801313933346, shouldDelete=...)
    at /usr/src/debug/baloo-5.99.0/src/engine/writetransaction.cpp:167

It looks like this entry in particular (parentId=36028801313933346) has a child with the same id. Gdb confirms this is the first entry in the list:

(gdb) frame 1071
#1071 0x00007ffff7e0bfc5 in Baloo::WriteTransaction::removeRecursively(unsigned long long, std::function<bool (unsigned long long)> const&) (this=0x7fbfdc004df0, parentId=36028801313933346, shouldDelete=...)
    at /usr/src/debug/baloo-5.99.0/src/engine/writetransaction.cpp:167
(gdb) print parentId
$1 = 36028801313933346
(gdb) print reinterpret_cast<quint64*>((reinterpret_cast<const char*>(children.d)+children.d->offset))[0]
$2 = 36028801313933346

Looks to me like a parent that has itself as a child (from "docUrlDb.getChildren(parentId)") causes the infinite recursion, resulting in a stack overflow (i.e. segfault).

Software/OS versions:
Operating System: Arch Linux
KDE Plasma Version: 5.26.2
KDE Frameworks Version: 5.99.0
Qt Version: 5.15.6
Kernel Version: 6.0.6-arch1-1 (64-bit)
Graphics Platform: X11