Bug 400704 - Baloo indexing I/O introduces serious noticable delays
Summary: Baloo indexing I/O introduces serious noticable delays
Status: CONFIRMED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: Baloo File Daemon (show other bugs)
Version: 5.64.0
Platform: Other Linux
: VHI major
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
: 359119 376446 379011 384234 393465 400932 401279 (view as bug list)
Depends on:
Blocks: 446071
  Show dependency treegraph
 
Reported: 2018-11-05 15:22 UTC by Axel Braun
Modified: 2022-07-15 17:06 UTC (History)
26 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Axel Braun 2018-11-05 15:22:30 UTC
As suggested in https://bugs.kde.org/show_bug.cgi?id=333655#c73 , lets open a new bug for baloo 5:

I'm running baloo 5.45.0 on openSUSE Leap 15, and notice that my complete desktop freezes regularly for 1-2 minutes(!). CPU monitor reports during that time 100% Load on both cores, but top does not show any process of a considerable CU load. The problem is more the couple of baloo and akonadi, as iotop shows:

Total DISK READ :      10.37 M/s | Total DISK WRITE :    1060.53 K/s
Actual DISK READ:      10.37 M/s | Actual DISK WRITE:     197.36 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                      
 4497 idle axel        9.73 M/s    0.00 B/s  0.00 % 99.52 % baloo_file_extractor
 2847 idle axel      651.54 K/s 1058.15 K/s  0.00 % 97.97 % akonadi_indexing_agent --identifier akonadi_indexing_agent
   23 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.10 % [kworker/1:1]
 2479 be/4 axel        0.00 B/s    0.00 B/s  0.00 %  0.08 % plasmashell
  849 be/4 root        4.76 K/s    0.00 B/s  0.00 %  0.00 % [xfsaild/sda2]

(interesting percentage calculation of iotop by the way)

System disk is a SSD, data disk is a hybrid 1TB disk with 8G cache.
I have configured the search to not index the file content. Thats why the heavy IO surprises me even more
Comment 1 Stefan Brüns 2018-11-05 17:52:06 UTC
Unfortunately even the two most fundamental databases in baloo, the Terms and the FileNameTerms DBs, show O(M^2) behaviour on updates. Everytime e.g. a "pdf" is changed, the associated value (i.e. the IDs of all matching documents) for the "pdf" term is updated.

An update may happen in two cases:
1. an existing file is appended, tagged, renamed ...
2. an existing file is replaced by an updated one (i.e. application creates a temporary file on saving and atomically replaces the old one).

For (1.), the update can be minimized, i.e. only updating the terms which have actually changed. I have some experimental patches for this.

For (2.), the database scheme has to be changed significantly.
Comment 2 Axel Braun 2018-11-06 08:06:56 UTC
Thanks for your explanation, Stefan. Although I dont know how I can influence the behaviour. If I start the computer the next day I would not expect heavy re-indexing.
Are - by default - the database stores for akonadi (~/.local/share/akonadi )excluded from baloo indexing?
Comment 3 Mayeul C. 2018-11-10 22:00:53 UTC
I came in to report the same problem. The system frequently freezes, with the mouse not moving for a couple seconds, or the screen not being refreshed.

Regardless of what is causing high IO usage within baloo and akonadi, I consider them background tasks (most of the time), and I would like to see them prioritized as such.

Could baloorunner be ran with the equivalent of ionice -c 3 by default? (and maybe nice as well). My CPU is quite beefy, but I suffer of I/O contention:

Arch Linux
Ryzen 7 2700X
8 GiB DDR4 2666
4TiB HDD system drive (WDC WD40EZRZ)

I will probably upgrade to a SSD at some point, but this is no excuse for a background task to consume all of the available disk IO bandwidth ;)
Comment 4 Stefan Brüns 2018-11-10 22:05:57 UTC
(In reply to Mayeul Cantan from comment #3)

> Could baloorunner be ran with the equivalent of ionice -c 3 by default? (and
> maybe nice as well). My CPU is quite beefy, but I suffer of I/O contention:

baloo_file/baloo_file_extractor, which are the indexing task (i.e. the one causing write accesses) are already running with lowest priority. baloorunner is not relevant here.

Even with low priority, the kernel eventually has to flush the write buffers, causing the high I/O latency for other tasks.
Comment 5 Axel Braun 2018-11-11 16:21:24 UTC
(In reply to Stefan Brüns from comment #4)

> Even with low priority, the kernel eventually has to flush the write
> buffers, causing the high I/O latency for other tasks.

Should the I/O traffic from higher prioritized tasks not processed before as well? I mean, if baloo does not get any CPU time, how can it create such a high traffic? Looking at iotop, it is mostly a factor 100 to 1000 higher than other tasks....
Comment 6 Mayeul C. 2018-11-12 14:13:20 UTC
(In reply to Axel Braun from comment #5)
> (In reply to Stefan Brüns from comment #4)
> 
> > Even with low priority, the kernel eventually has to flush the write
> > buffers, causing the high I/O latency for other tasks.
> 
> Should the I/O traffic from higher prioritized tasks not processed before as
> well? I mean, if baloo does not get any CPU time, how can it create such a
> high traffic? Looking at iotop, it is mostly a factor 100 to 1000 higher
> than other tasks....

From this link, it seems to be the case (though a link to the kernel source would have been nicer)
https://unix.stackexchange.com/questions/153505/how-disk-io-priority-is-related-with-process-priority

> io_priority = (cpu_nice + 20) / 5

In my case, though, it was always baloorunner showing at 99.99 % I/O in iotop. baloo_file_extractor would also run sometimes, but with a lesser subjective impact on performance.
Setting baloorunner to a lower priority using ionice seemed to improve things quite a bit, although I would have to confirm it.

I get the point about needing to flush the cache at some point. Unfortunately, I am at a loss as to why my mouse freezes because of it. I am on a 8 (16 SMT)-core CPU, and only a couple are used by the kernel. CPU <-> RAM bandwidth should not be the limiting factor, and other threads should be able to go trough when CPU <-> Sata Controller is being waited on. Maybe it has to do with interrupts comming in from the SATA controller?
Comment 7 Jack 2018-11-17 18:22:20 UTC
Same problem with baloo 5.52.0 (on Artix Linux).  GUI is almost completely unresponsive.  Switching to text console and back updates the screen, but it mostly stays frozen.  Sometimes clicking to switch between applications updates things when I click, but otherwise frozen.

iotop shows baloo_file_extractor and one [kworker...] job at 99.99% (sometimes alternating with a lower value still above 50%.)  Systemsettings/search does not have any setting to turn indexing off, although no plugin is checked.  balooctl does seem to show everything disabled and stopped, so I have no idea why .

For me, this seems to have started relatively recently, but it's on a laptop I don't use constantly, so I'm really not sure what updated triggered it.  Is there anything else I can check, or any other data I can provide.  It makes the laptop essentially unusable. (I'm posting this from a different PC (Gentoo) although baloo here is 5.50.0 - I'll try updating.)
Comment 8 Jack 2018-11-17 22:42:52 UTC
After several reboots, I finally had systemsettings5 show me file search, and turning that off, and another reboot, seems to have stopped the indexer from running.

The odd thing was that despite earlier doing balooctl suspend, balooctl stop, and balooctl disable, and balooctl showing disabled, it was still running.  Not really sure what finally stopped it.  Hopefully it wont just start up again by itself.
Comment 9 Nate Graham 2018-11-26 17:19:18 UTC
*** Bug 400932 has been marked as a duplicate of this bug. ***
Comment 10 Nate Graham 2018-11-26 17:19:30 UTC
*** Bug 401279 has been marked as a duplicate of this bug. ***
Comment 11 Nate Graham 2018-11-26 20:24:59 UTC
*** Bug 384234 has been marked as a duplicate of this bug. ***
Comment 12 Nate Graham 2018-11-26 20:27:31 UTC
*** Bug 379011 has been marked as a duplicate of this bug. ***
Comment 13 Nate Graham 2018-11-26 20:27:42 UTC
*** Bug 376446 has been marked as a duplicate of this bug. ***
Comment 14 Nate Graham 2018-11-26 21:20:33 UTC
There's a proposed patch in Bug 356357 that sparked a serious discussion about the frequency with which the DB should be written to, but unfortunately it went nowhere.
Comment 15 Nate Graham 2018-11-26 21:32:09 UTC
*** Bug 359119 has been marked as a duplicate of this bug. ***
Comment 16 Nate Graham 2018-11-26 21:59:50 UTC
*** Bug 393465 has been marked as a duplicate of this bug. ***
Comment 17 Alberto Salvia Novella 2018-11-27 01:12:25 UTC
Since I'm not using Plasma right now I'm unsubscribing from this bug, but feel free to re-subscribe me if you needed any help from me.
Comment 18 Kevin Colyer 2018-12-06 12:36:59 UTC
(In reply to Nate Graham from comment #14)
> There's a proposed patch in Bug 356357 that sparked a serious discussion
> about the frequency with which the DB should be written to, but
> unfortunately it went nowhere.

I am still suffering this problem. Yesterday nextcloud decided to refresh my files and downloaded about 10G of files. Baloo started indexing and my desktop stalls. Chrome can't start and and can do no work!!!!

I do hope we can get a solution soon - this is a long standing problem. Finding things with an baloo saves me time... but not as much as I am loosing whilst waiting for the indexer!!!!!

Please can we have a solution - 

I like the idea of throttling database updates - perhaps some sort of exponential stand-off approach but inverted so high number of files index per minute changes updates to 80, 160, 320 ... limit ?
Comment 19 Stefan Brüns 2018-12-06 14:18:43 UTC
An exponential backoff would only help if baloo would index the same files recurrently.

If you add new documents to your indexed folders, baloo will process these. It will not get better when you commit changesets double the size, the stalls will be even longer.

This is *not* a trivial problem which can be solved by adjusting a single knob.

Baloos datastructures currently impose a changeset size which is approximately proportional to the size of the database. Adding/changing a single small document can cause a DB update of several 100 MBytes.
Comment 20 Kevin Colyer 2018-12-06 15:55:35 UTC
(In reply to Stefan Brüns from comment #19)
> An exponential backoff would only help if baloo would index the same files
> recurrently.
> 
> If you add new documents to your indexed folders, baloo will process these.
> It will not get better when you commit changesets double the size, the
> stalls will be even longer.
> 
> This is *not* a trivial problem which can be solved by adjusting a single
> knob.
> 
> Baloos datastructures currently impose a changeset size which is
> approximately proportional to the size of the database. Adding/changing a
> single small document can cause a DB update of several 100 MBytes.

Thanks for the prompt feedback. Currently I have to do a manual exponential backoff of switching off baloo and turning it on overnight to do it's indexing!!!

Given that a "single small document can cause a DB update of several 100 MBytes." might there need to a fresh look given to the underlying data structure? That seems sub-optimal to me as a user who is struggling with the indexing processes unintended side-effects.
Comment 21 Stefan Brüns 2018-12-06 16:38:01 UTC
It would save a lot of developer time if not everyone would add their "me too" comments.

Changes to the database are planned, but this is not trivial. One structure may work well for a number of cases and cause huge problems for others. These changes have to be evaluated, for performance and for correctness.

The baloo codebase has been enhanced with additional unit tests recently, increasing code coverage and reducing the chance for regressions. This is an
ongoing effort likely taking several more months until completeted.

Baloo is currently developed mostly by volunteers doing it in their spare time. Development will not go faster by adding some more exclamations marks ...
Comment 22 Kevin Colyer 2018-12-06 16:57:06 UTC
(In reply to Stefan Brüns from comment #21)
> It would save a lot of developer time if not everyone would add their "me
> too" comments.
> 
> Changes to the database are planned, but this is not trivial. One structure
> may work well for a number of cases and cause huge problems for others.
> These changes have to be evaluated, for performance and for correctness.
> 
> The baloo codebase has been enhanced with additional unit tests recently,
> increasing code coverage and reducing the chance for regressions. This is an
> ongoing effort likely taking several more months until completeted.
> 
> Baloo is currently developed mostly by volunteers doing it in their spare
> time. Development will not go faster by adding some more exclamations marks
> ...

Dear Developers,

I am supremely grateful for all the work and efforts that have gone into the indexing services for KDE. If I had the skills I would join you. I just glanced at the Git repo and realised how unskilled I am to contribute; I couldn't even find the schema. Baloo has improved greatly. 

However, I do wish to say please don't discourage well intentioned feedback. Without feedback from users about their actual problems encountered future priorities may not be as readily identified. As a long term KDE user, enthusiast and advocate feedback is one of my most important contributions. This thread follows from https://bugs.kde.org/show_bug.cgi?id=333655#c73 which was started in 2014. I am only making my first comment now. The performance issues have been a problem to me for all this time and I went for a long season with baloo permanently off!

Do let me know if there is anything concrete I can contribute more than what I offer in these comments.
Comment 23 richard 2019-04-23 06:27:33 UTC
currently after system start and sometimes during work baloo grabs one cpu for 100% for quite a while, eats up to 13GB ram and makes the system quite unresponsive (i7 w 4c+ht, 20gb ram, only ssd) when running.
this looks more like a complete reindexing of everything on the system and not related to the amount of changed files. and i don't see how I could find what exactly baloo is working on through means of balooctl. 
i use fedora 29 with standard schedulers. baloo taking 100% of cpu seems not reasonable to me. also taking up to 13GB ram seems not reasonable to me.

during the last long run baloo status stated 
130470/133866 files index and current index size 15,61 GiB. there was no change of more than 3000 large files since the last baloo 100% cpu run. indexing of a new 185m git clone should have be done in very few minutes max.

Further the numbers in indexSize look strange to me

Actual Size: 15,61 GiB
Expected Size: 9,16 GiB

           PostingDB:       1,40 GiB   120.956 %
          PositionDB:     133,54 MiB    11.266 %
            DocTerms:     877,32 MiB    74.014 %
    DocFilenameTerms:      13,61 MiB     1.148 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       2,52 MiB     0.213 %
          IdFileName:      10,07 MiB     0.850 %
             DocTime:       5,64 MiB     0.476 %
             DocData:       6,92 MiB     0.584 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       2,18 MiB     0.184 %

Why is the expected size only 2/3 of actual?
And why don't the DB sizes sum up to the actual size?
And what does 120% really mean?
Comment 24 Stefan Brüns 2019-04-23 10:25:58 UTC
(In reply to richard from comment #23)
> 
> Why is the expected size only 2/3 of actual?
> And why don't the DB sizes sum up to the actual size?
> And what does 120% really mean?

Re actual size:
https://cgit.kde.org/baloo.git/commit/?id=f8c51b23796523f9b2d9d1582c7fb874181fbf2f

Re 120%:
https://cgit.kde.org/baloo.git/commit/?id=7be886c93d13191c6ebdf72669f657cbbf45c2c7
Comment 25 Stefan Brüns 2019-04-23 10:32:18 UTC
==============================

Dear Users,

the issue described in this bug report is well understood. Solving the problem requires significant changes to the database scheme. Before doing this changes we have to be sure not to regress other use cases.

Screening bugs takes time, time better spent working on this problem and solving other issues at hand.

Please refrain from adding additional comments here!

Kind regards, Stefan

==============================
Comment 26 richard 2019-04-26 06:36:32 UTC
Hi
the comming change from actual to file size is good, also the change from expected to used.
the link concerning 120% is not really clear to me. some sum was changed. but the ouput still remains unexplained, 120% of what.


i understand that you identified the database layout as *the* problem. 
from my point of view i'd see the cpu (-> io) greed as an problem. in another thread there was the statment that it works better/less blocking the system with other schedulers. it really would be ok, at least for me, if the indexing is done silently in background and not as fast as posslble, blocking the system. looks independent from db schema changes to me.

at the moment I work with ml datasets -> archive files, but below GB size. it looks as if the baloo_file_extr ist the process to be blamed. currently:  
PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                        
 **** user      39  19  259,7g  13,6g  10,3g R  97,0  70,5 355:49.30 baloo_file_extr                    
I let it run the whole night to get through his work - not ready yet. still freezing the system (even mouse) quite often. 
as written i'd be ok to not index these files as fast as possible. and i can't really understand how it can take 7 hours of fast cpu time to index less than sub gb archives for the baloo_file_extr. 
looks independent from db schema changes to me.

you didn't answer the point that one can see what baloo is currently doing.
that could help a) for debugging and b) adjusting/excluding directories/files from indexing. 
with options to tune indexing like
- don't index when (allow combinations)
-- filetype
-- if size is smaller/bigger
-- is is more/less than timespan at specific disk location
-- is is more/less than timespan created / modified
and a monitor command that allows to see the freeze causes in realtime
and a log that allows to see the freeze causes (files) later (log start indexing / stop index of file) (when indexing it takes more than ___ minutes between start and stop, when indexing took more than ___ minutes since start it, ....  . this looks independent from db schema changes to me and being able to tune baloo just to not do some things it has problems with would help to optimize the usability until the big rewrite is done. 
it's a question how priorities are set.
Comment 27 Øystein Steffensen-Alværvik 2019-11-28 15:24:30 UTC
Confirmed on openSUSE Tumbleweed with Frameworks 5.64. Everything freezes for about 30 seconds, works for 30 seconds, then freezes again. The only solution is to turn Baloo completely off. This is also a considerable problem when only files, not their contents, are being indexed. 

Operating System: openSUSE Tumbleweed 20191124
KDE Plasma Version: 5.17.3
KDE Frameworks Version: 5.64.0
Qt Version: 5.13.1
Kernel Version: 5.3.12-1-default
OS Type: 64-bit
Processors: 4 × Intel® Core™ i5-4210U CPU @ 1.70GHz
Memory: 11,6 GiB
Comment 28 Øystein Steffensen-Alværvik 2019-12-17 10:09:38 UTC
This happens both if Baloo is index file *contents*, and when it's just indexing info on files. I have to turn indexing completely off, if not my computer becomes practically unusable upon every power on. 
This is new and I've never had trouble with Baloo on this laptop. It's admittedly a 4 year old computer, but the SSD is fast and the laptop handles most of my workflow otherwise completely fine.
Comment 29 Oded Arbel 2020-03-19 22:59:27 UTC
Same problem here - Baloo eats up all IO even when reporting "idle". This has become a problem only in the last year or so. I'm using a pretty beefy i7 device with an NVME and while Baloo is enabled the computer often is slow and freezes from time to time. Looking at CPU usage I see `baloo_file` takes 50%~80% CPU, and loadavg is around 3~4.5 (on 4 core system).

Looking at IO:
---8<---
$ pidstat -G balo[o] -dl 5 1; balooctl status
Linux 5.4.0-17-generic (vesho)  03/20/2020      _x86_64_        (8 CPU)

12:48:52 AM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
12:48:57 AM  1000     80482  26164.94  64858.96      0.00     170  /usr/bin/baloo_file 

Average:      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
Average:     1000     80482  26164.94  64858.96      0.00     170  /usr/bin/baloo_file 
Baloo File Indexer is running
Indexer state: Idle
Total files indexed: 297,288
Files waiting for content indexing: 0
Files failed to index: 0
Current size of index is 2.56 GiB
----8<----

So balooctl reports "Idle" while baloo_file pushes >60Mbit/sec to the drive and does not insignificant reading.

In .xsession-errors log I can see a lot of messages like this:

----8<----
org.kde.baloo.engine: DocumentDB::get 307907124573241397 MDB_NOTFOUND: No matching key/data pair found
----8<----
Comment 30 Mircea Kitsune 2021-08-02 21:41:23 UTC
A very real and annoying issue. I've kept Baloo disabled for years now, due to it putting my hard drive in "disk sleep" and causing processes on the system to freeze while waiting for drive access. Nowadays I have a different HDD setup so I managed to enable it with some directories blacklisted. Still eats more RAM than it should... if it's not drive I/O it's gonna be the memory or CPU.
Comment 31 tagwerk19 2021-09-04 08:44:21 UTC
It could be that there are several different issues being "bundled together".

1...

    There are, for example, problems with openSUSE that runs BTRFS
    with multiple subvols, check with finding one of the files indexed
    and trying the following...

        stat testfile
        balooshow -x testfile 

    and

        baloosearch -i filename:testfile 

    The "stat" would give you the device and inode number of the file.
    You should see the same numbers listed in the "balooshow -x"
    results. See:

        https://bugs.kde.org/show_bug.cgi?id=402154#c12

    If the device/inode numbers change for a file, baloo will think it
    is a different file and index it again. You can see this evidenced
    in the "baloosearch -i" results, you could get multiple results
    (different ID's; same file)

2...

    Repeated spike loads at logon. In cases where there are *very* *many*
    new files, even if content indexing is disabled, the initial scan by
    baloo_file takes too many resources,

    My reading of the behaviour is that baloo_file does not "batch up"
    updates to the index as it discovers new/changed/deleted files.
    There's therefore no hint (looking at "balooctl status") that there's
    any progress being made, it may be that the indexing if "Idle" as
    just an initial scan is being done (and not content indexing) and
    the RAM used by baloo_file can grow steadily (potentially extending
    to swap space).

    As per Bug 394750:

        https://bugs.kde.org/show_bug.cgi?id=394750#c13

    If the updates from an "initial scan" are done as a single transaction
    there are no checkpoints. Killing the process and starting again,
    rebooting or logging out and back in again will start "from scratch".

    Bug 428416 is also interesting in terms of what baloo_file is doing
    when it deals with a large indexing run.

3...

    It seems likely that with baloo reindexing files as they reappear
    with different ID's (as per '1' above) the index size balloons;
    on disc and in terms of pages pulled into memory. This will
    compound issue '2'.

4...

    On a positive note, the impact (as seen by the user) of a sync of
    the dirty pages to disc could be manageable if the index is on
    an SSD

    Comment 19 argues against increasing the batch size (that the data
    will have to be written at some time). This would hammer HDD users
    but maybe have has less impact on SSD users.

    With an SSD, there's the counter argument that you want to avoid
    frequent rewrites to prolong the life of the disc. Gut feeling is
    that with a larger batch size, the data written to disc is less
    in total.

Wishlist/Proposals/Suggestions

    I think baloo needs to "batch up" its transactions in its initial scan.
    If I were to suggest "how often", I'd pick a time interval, maybe
    every 15 or 30 seconds.

    It would be nice to have a "balooctl" option (or a setting within
    baloofilerc) to tune the batch size used for baloo_file_extractor.
    That would make it possible to do indexing comparisons "in the
    real world"

Consider this as a "Where are we?" summary; an attempt to collect together different threads and weave in new evidence.
Comment 32 Mircea Kitsune 2021-09-04 08:57:59 UTC
The issue seems to have gotten somewhat better at this day, especially with the latest Plasma version 5.22. Though I've since moved to using an SSD / NVME drive, might be why disk sleep isn't as bad as it used to be during indexing.

Another issue now seems to be the baloo processes are using more memory than I wish they did, based on the amount of files it indexed. If anyone has a large HDD but not enough RAM, they'll need to blacklist every large directory.
Comment 33 tagwerk19 2021-09-24 07:19:21 UTC
(In reply to tagwerk19 from comment #31)
> Consider this as a "Where are we?" summary; an attempt to collect together
> different threads and weave in new evidence.
Weaving in a couple of extra references "for completeness":

5...
    Removing baloo records for deleted files seems to be slow
    (more I/O intensive than the original indexing). See Bug 442453

6...
    Running a "balooctl status" while baloo is removing records for
    deleted files, causes memory consumption and index size to
    balloon, Bug 437754
Comment 34 pierre 2021-10-05 08:11:11 UTC
Hi,
One comment I have not seen in the long list since 2014 :
The slow down appears as I had just upgraded to 20.04 LTS ans I remember that I had the same problem 3 years ago after upgrading to 18.04. So I had a day or two leaving the computer on so it would get over indexing (during a weekend)
Wouldn't it be nice if the database was left as it is while upgrading ?
Comment 35 tagwerk19 2021-10-11 07:38:33 UTC
(In reply to pierre from comment #34)
> The slow down appears as I had just upgraded to 20.04 LTS ans I remember
> that I had the same problem 3 years ago after upgrading to 18.04. 
You would get a reindexing if the device number of your discs changed. You can see if that has happened if you run
    $ baloosearch -i filename:"one of your files"
and you get multiple results with different ID's. Check the file itself
    $ stat "one of your files"
and compare the device details:
    Device: fc01h/64513d    Inode: 1053347     Links: 1

Beyond that, I'm not sure. I don't remember having met the issue.
Comment 36 tagwerk19 2021-10-11 07:47:08 UTC
One more observation for the collection.

It may be that "spike loads" in memory usage trigger OOM protection and baloo_file_extractor and baloo_file are killed.

Tangentially observed in Fedora 35:
    https://bugs.kde.org/show_bug.cgi?id=443547#c2
but needs a closer look...
Comment 37 pierre 2021-10-11 08:33:43 UTC
(In reply to tagwerk19 from comment #35)
>     $ baloosearch -i filename:"one of your files"
> and you get multiple results with different ID's. Check the file itself
>     $ stat "one of your files"
Hi,
Just 1 file but chosen at random. Actually, this file might not have been "baloo-ed" before I killed baloo-file-extractor. No way to find out but through sample polling files and testing them the way you suggest, isn't it ? (way beyond my ability)
Comment 38 tagwerk19 2021-11-23 13:46:43 UTC
Another reference "for completeness":

8...
    baloo_file_extractor can get caught on files that require hours to index, the example case
    being a PDF containing a scientific plot. The plot itself is compressed data with little
    indexable content and unpacking it may require more RAM than you have available

    See https://bugs.kde.org/show_bug.cgi?id=380456#c21

    It's possible that such indexing attempts trigger OoM protections and therefore never complete.

    It would make sense to have time/memory limits for such actions (and flag the file as
    "failed" if the extraction exceeds them).
Comment 39 Mircea Kitsune 2021-11-25 15:35:58 UTC
(In reply to tagwerk19 from comment #38)

+1 on that idea. Dolphin actually has a file size limit for generating thumbnails, in Manjaro you need to manually remove it or most images won't generate thumbnails at all. It would be more than logical to have something like this for Baloo, indicating a size limit past which a file's contents will not be indexed (only its name and location). Thanks for this suggestion.
Comment 40 Martin Steigerwald 2021-11-25 18:43:28 UTC
Wow, kudos for the new website design for Bugzilla.

It could also be a limit up to which files would be indexed. I.e. go for the first 100 KiB and ignore the rest, instead of just indexing the file name in such cases. Not sure whether it is worth to do it this way. IMHO it depends on the type of file. For a lot of file formats for larger files it would only make sense to index metadata, like for video or sound or image files. I think and hope that Baloo is already doing this.

Other large files are archives like tarballs or ZIP files.
Comment 41 tagwerk19 2021-11-25 19:42:56 UTC
(In reply to Martin Steigerwald from comment #40)
> ... go for the
> first 100 KiB and ignore the rest, instead of just indexing the file name in
> such cases...
At the moment there's a 10MByte limit for text or html:
    https://bugs.kde.org/show_bug.cgi?id=410680#c7
Personal preference would be that the first 10MB is indexed and the rest ignored but it seems that if the file is more (more or less more) than 10MB it's not indexed.
Comment 42 Mircea Kitsune 2021-11-25 20:16:42 UTC
(In reply to tagwerk19 from comment #41)

Yeah 100MB sounds like a good default limit for all files. I'd make it an option in the search settings of course, users should be able to customize this based on the amount of files they have and the power of their computer.
Comment 43 Adam Fontenot 2022-03-23 08:39:31 UTC
(In reply to tagwerk19 from comment #38)
>     It would make sense to have time/memory limits for such actions (and
> flag the file as
>     "failed" if the extraction exceeds them).

Was thinking about this and similar IO problems, and decided to have a look at how Gnome's "tracker" is handling things these days. Going to document my findings here in the hope it's useful as inspiration for how we might handle similar problems. I think it's an important point of comparison for Baloo.

I have mostly positive things to say, although Tracker also has some flaws (it didn't pick up my XDG Documents folder by default, it didn't index the contents of files with text/plain mimetimes that don't have file extensions, and it uses a large amount of CPU while searching in Nautilus).

 * I enabled Tracker to index my home folder (with content indexing) and it uses 474 MB on my $HOME. I've completely disabled content indexing for Baloo, but it's somehow using 1.4 GB. Suffice it to say that Baloo is weirdly inefficient. (ContentIndexingDB is empty, so it's not old content indexes.) More research needed here, any suggestions appreciated.

 * Unlike Baloo, Tracker does not hang when given pathological files. (See the link in tagwerk19's comment for an example.) I get a very sensible "Crash/hang handling file" message in the log for this file and it's otherwise ignored. Among other checks, they appear to kill the process if the content indexer takes more than 30 seconds on a file, which seems quite reasonable: https://gitlab.gnome.org/GNOME/tracker-miners/-/blob/master/src/tracker-extract/tracker-extract.c

 * They have some cool features around full text search including unaccenting and case folding, and use SPARQL for queries: https://wiki.gnome.org/Projects/Tracker/Features I haven't seen enough documentation from Baloo to know how we stack up there.

 * Tracker and Baloo both blacklist source code files by default, among several other types. Baloo doesn't expose this to the user in the UI, which I think might surprise some users who expect more configurability from KDE.

 * Tracker seems not to be very configurable. There's a bit of under the hood adjustment possible, but mostly the focus seems to be on having good heuristics out of the box. I don't think we could trivially swap Tracker for Baloo and having everything we need work. We'll need to keep improving Baloo. :-)

This comment might be better off on the Wiki somewhere, but it seems pretty underutilized and I'm not sure where I'd put it or if anyone would even read it there.