Bug 413694 - Baloo loops when indexing the nixpkgs source tree
Summary: Baloo loops when indexing the nixpkgs source tree
Status: RESOLVED WORKSFORME
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Baloo File Daemon (show other bugs)
Version: unspecified
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Stefan Brüns
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-01 01:49 UTC by p3dimaria
Modified: 2022-01-12 17:23 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description p3dimaria 2019-11-01 01:49:49 UTC
SUMMARY


STEPS TO REPRODUCE
1. git clone https://github.com/NixOS/nixpkgs
2. balooctl check
3. balooctl monitor
OBSERVED RESULT
balooctl monitor shows the same files will keep be indexed again and again,and balooctl status shows no reduction in the number of files to be indexed

EXPECTED RESULT
balooctl status shows progressively less files to be indexed

SOFTWARE/OS VERSIONS
Operating System: NixOS 20.03pre198214.4cd2cb43fb3
KDE Plasma Version: 5.16.5
KDE Frameworks Version: 5.62.0
Qt Version: 5.12.5
Kernel Version: 5.2.21
OS Type: 64-bit
Processors: 4 × Intel® Core™ i7-5500U CPU @ 2.40GHz
Memory: 15,6 GiB


ADDITIONAL INFORMATION
Comment 1 hsngrmpf+kde 2019-11-13 16:10:14 UTC
Having the exact same issue in nixos-19.03. But for me baloo is stuck while indexing a custom download of the tor browser in the background.

Also baloo has a memory leak while doing this which makes it consume several gigabytes of memory after a while. This is the only reason i noticed the problem.
I never touched baloo manually. I didn't even know what it is until 15 minutes ago when i decided to finally start debugging this crazy thing called 'baloo_file_ext' that i have to kill over and over again since days to not freeze my system.

Please tell me how i can help to debug this.
my baloo monitor looks like this:

$ balooctl monitor
Press ctrl+c to stop monitoring
File indexer is running
Indexing file content
Indexing: /home/grmpf/synced/programs/tor-browser_en-US/Browser/TorBrowser/Tor/PluggableTransports/obfs4proxy: Ok
Indexing: /home/grmpf/synced/programs/tor-browser_en-US/start-tor-browser.desktop: Ok
Indexing: /home/grmpf/synced/programs/tor-browser_en-US/Browser/TorBrowser/Tor/PluggableTransports/obfs4proxy: Ok
Indexing: /home/grmpf/synced/programs/tor-browser_en-US/start-tor-browser.desktop: Ok
Indexing: /home/grmpf/synced/programs/tor-browser_en-US/Browser/TorBrowser/Tor/PluggableTransports/obfs4proxy: Ok
Indexing: /home/grmpf/synced/programs/tor-browser_en-US/start-tor-browser.desktop: Ok
...

It always shows indexing the same to files over and over again.
Comment 2 Stefan Brüns 2020-08-06 02:25:25 UTC
Are there any messages in the journal?
Comment 3 Bug Janitor Service 2020-08-21 04:33:09 UTC
Dear Bug Submitter,

This bug has been in NEEDSINFO status with no change for at least
15 days. Please provide the requested information as soon as
possible and set the bug status as REPORTED. Due to regular bug
tracker maintenance, if the bug is still in NEEDSINFO status with
no change in 30 days the bug will be closed as RESOLVED > WORKSFORME
due to lack of needed information.

For more information about our bug triaging procedures please read the
wiki located here:
https://community.kde.org/Guidelines_and_HOWTOs/Bug_triaging

If you have already provided the requested information, please
mark the bug as REPORTED so that the KDE team knows that the bug is
ready to be confirmed.

Thank you for helping us make KDE software even better for everyone!
Comment 4 hsngrmpf+kde 2020-08-21 04:49:11 UTC
I've switched away from KDE to i3 in the meantime, so i cannot tell.
I'll just link you the issue where i was originally coming from: https://github.com/NixOS/nixpkgs/issues/63489

Maybe the problem is related to btrfs (and it's autdefrag option)?

But even if it's unknown why baloo rescans the same files over and over again, it still should not memory leak.
So there are 2 bugs in one. The memory leaking should be fixable without knowing why it loops.
Comment 5 Christoph Feck 2020-09-03 09:38:33 UTC
New information was added with comment 4; changing status for inspection.
Comment 6 Stefan Brüns 2020-09-03 12:20:37 UTC
Bug reporter has *not* provided the information asked for, and obviously has no interest to help.
Comment 7 Joachim Wagner 2021-12-15 10:50:03 UTC
I observe the same / similar issue with btrfs for /home. I've about 20k files, mostly PDFs, to be indexed and when all files are indexed baloo starts over again in an endless loop.

In /var/log/messages, I see lots of baloo_file_extractor / kf.baloo messages "id seems to have changed. Perhaps baloo was not running, and this file was deleted + re-created" messages as reported in https://github.com/NixOS/nixpkgs/issues/63489#issuecomment-563007599
@Stefan: Is /var/log/messages the "journal" you are referring to?

I am using the default KDE of openSUSE Leap 15.3. According to Kontact > Help > About Kontact > Libraries, it uses "KDE Frameworks 5.76.0" and "Qt 5.12.7 (built against 5.12.7)".

I did not have this problem using XFS for /home on the same OS, so I concur with comment 4 that this may be specific to using btrfs.

I am not too worried about the CPU usage but the index growing by a few GB in every round is a problem for me.

If the issue cannot easily be fixed I'd therefore welcome a partial solution at least avoiding large index updates. This could for example be implemented by recording the sha256 fingerprint of every indexed file and only indexing the contents of files with a new sha256 fingerprint and linking files with the same content in the baloo database.
Comment 8 Joachim Wagner 2021-12-15 11:53:56 UTC
Re the idea of recording the sha256 of each file, this may be problematic for large files with only a small content area such as meta data and subtitles of a video. Still, reading excessive amounts of data can be preferable over writing excessive amounts of index data. A solution may be to require the content indexer modules to support returning a content fingerprint, with the default implementation running the normal content extraction and calculating a fingerprint over the extracted content. File-format-specific implementations can skip some processing steps such as decompression of a data stream and character set conversion.
Comment 9 tagwerk19 2021-12-15 13:18:17 UTC
(In reply to Joachim Wagner from comment #7)
> I observe the same / similar issue with btrfs for /home ...
> ... openSUSE Leap 15.3 ...
Yes, there's an issue with openSUSE, BTRFS and multiple subvols, as per:
    https://bugs.kde.org/show_bug.cgi?id=402154#c12
I'm not sure this would explain the original issue though, of same files being indexed in a loop. Might be worthwhile checking whether the baloo_file_extractor process is crashing.
Comment 10 Joachim Wagner 2022-01-12 17:23:17 UTC
Yes,  thanks. My logs confirm the re-indexing co-occurs with the use of a new virtual device number for the btrfs filesystem.