Bug 438074 - baloo reindexing files on every start
Summary: baloo reindexing files on every start
Status: REPORTED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.82.0
Platform: Neon Linux
: NOR minor
Target Milestone: ---
Assignee: baloo-bugs-null
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-06-04 08:31 UTC by Martin Tlustos
Modified: 2024-01-23 23:24 UTC (History)
8 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Tlustos 2021-06-04 08:31:25 UTC
SUMMARY
baloo reindexes files on every start, causing high disk usage and slow down of system

STEPS TO REPRODUCE
1. start kde
2. open konsole and type "balooctl monitor"
3. 

OBSERVED RESULT
baloo reindexes files that have been indexed about a hundred times already, causing high disk load, slowing down the system for a few minutes at the beginning.

EXPECTED RESULT
the system should be highly responsive from the beginning

Operating System: KDE neon 5.21
KDE Plasma Version: 5.21.5
KDE Frameworks Version: 5.82.0
Qt Version: 5.15.3
Kernel Version: 5.4.0-74-generic
OS Type: 64-bit
Graphics Platform: Wayland
Processors: 4 × Intel® Core™ i5-7200U CPU @ 2.50GHz
Memory: 7.6 GiB of RAM
Graphics Processor: Mesa Intel® HD Graphics 620
Comment 1 tagwerk19 2021-06-04 12:33:12 UTC
You say "Neon" rather than, say, "openSuse" however it might be worth looking at:
    https://bugs.kde.org/show_bug.cgi?id=402154#c12

The issue in that case is that baloo expects the device number / inode for files to be stable (not change every reboot). With certain filessystems/distributions the devno can change, with remote filesystems it seems that the inode can also change.

Try the test with "stat" and "balooshow -x" and see what you see.
Comment 2 Martin Tlustos 2021-06-07 11:56:44 UTC
Well, the first thing I tried was exiting and reinitiating my normal user account session, and baloo started reindexing. Again...
Anyway, I will try the test you suggested to find wether the device number has changed.
Comment 3 Martin Tlustos 2021-06-07 12:08:31 UTC
Ok., did the test as suggested. No changes in device number or inode. 
I did 
stat testfile.txt >statinfo.txt and 
balooshow -x testfile.txt >balooinfo.txt,

restarted
did stat testfile.txt > statinfo-new.txt 
and 
balooshow -x testfile.txt > balooinfo-new.txt

And compared statinfo.txt with statinfo-new.txt and balooinfo.txt with balooinfo-new.txt. No differences.
Comment 4 tagwerk19 2021-06-07 21:28:19 UTC
(In reply to Martin Tlustos from comment #3)
> Ok., did the test as suggested. No changes in device number or inode. 
It was a bit of a guess - but it would have explained the reindexing.

A possible thing to look at is whether the modified and changed times that balooshow gives for your testfile.txt (the Mtime and Ctime), match those that stat gives (Modify and Change times).

I'm guessing you've purged the database and started "from zero". Does the same thing happen if you create a new user?

Could there be any confusion caused by symbolic links within your $HOME?

Have to say, I've not seen this issue so there's some guesswork involved here...
Comment 5 Martin Tlustos 2021-06-08 09:34:35 UTC
mtime and ctime match in both files.
I did purge the database some time ago for that very reason, but it didn't help.
I do have a few symlinks, but they are not in any of the folders baloo checks (only in /.trash, /.var/app, /.local, /.config and /.mozilla).
It's always the same folders that are checked, so it could be a problem with those folders, but I didn't see any problems with folder settings or permissions.

Btw, the same thing happens if i do balooctl check. baloo checks around 750 files (some of which haven't been changed in years), and baloo_file_extractor writes up to 20MB/s for about two minutes.
Comment 6 tagwerk19 2021-06-08 13:13:57 UTC
So...

    It's not all your files, just some folders.

    Baloo has accurate modification/change times - these haven't changed,
    and the device number/inode also hasn't changed, but a "baloo check"
    still thinks that the files needs (re)indexing.

More random thoughts...

    When you did the test with "stat testfile.txt" it was in one of
    these "odd" folders?

    The folders are not encrypted folders or related to snaps (which might
    be mounted in a different way)

    It's not a particular filetype that is giving trouble? (Dunno if baloo
    worries if a file "was" plain text and then "seems to be" something else)
    
    Do you get anything strange if you do a
       baloosearch ...oneofthefunnyfiles...
    you get a single or multiple hits?

After that I think the next boring, pedestrian, troubleshooting step is to copy some of the troublesome folders (copying with all metadata "cp -a ...") and see if you copy the problem as well. Perhaps try copying to a new user.

I will also say, thank you for your patience...
Comment 7 Martin Tlustos 2021-06-09 14:46:36 UTC
Ok, some more testing...

Copying the content of a afflicted folder to a new folder doesn't help, the new folder is reindexed as well.
stat and balooshow dont see any differences in files in those folders after reboots (I only checked one. The original test file was on a different account, but I redid it in my own account as well).

Creating a new test file in one of these folders is NOT reindexed, so this indicates that it actually is a file problem.

This is just a normal home folder on a separte HDD with ext4 formatting. The OS is on a different SDD drive. No snap, no encryption. Different file types are affected, like png, jpg, pdf, doc, odt...
Comment 8 tagwerk19 2021-06-09 20:28:05 UTC
(In reply to Martin Tlustos from comment #7)
> Copying the content of a afflicted folder to a new folder doesn't help, the
> new folder is reindexed as well.
So you've copied the problem - and it seems to be "a file problem" :-)

I know that baloo_file_extractor deals with batches of files, 40 at a time. I don't know if it commits "what it's learned" after each file or at the end of the batch but I can imagine that it's committed at the end of the batch.

If there's a (bad enough) failure indexing one of the files, it may be that no "content index" information for the batch is written to the index. The indexing of these files is incomplete so baloo, after a "balooctl check", tries again.

It should be that baloo recognises such failures, "balooctl status" does give a count of "Files failed to index". Maybe that's not working as it should.

Anyway, it might be possible to see some evidence...

For one of the files that are repeatedly reindexed, have a look with "balooshow -x ..." and what's listed under the "Internal Info":

    If this is very basic ("Terms", "File Name Terms", "XAttr Terms"),
    these are what "baloo_file" writes during its initial scan.

    If you see a longer list of "Terms", words that appear within the
    document, or possibly a "Width:" and "Height:" for an image (could
    be loads of different fields for an image file), then this is
    information collected and written by "baloo_file_extractor".

So compare what "balooshow -x .. " gives for your repeatedly reindexed files - and compare that to what "balooshow -x" gives for files that have been indexed OK.

My guess is if you see only "basic information" then something's failed and the data's not been committed. After that it might be a question of trying to "narrow down" the file/files that are causing the problem.
Comment 9 Martin Tlustos 2021-06-10 14:28:25 UTC
Ok, here's what I did: I ran balooshow -x * > balooshowinfo.txt in one of the affected folder where there aren't too many files (but more than 40, so one batch at least) and checked contents.
Sample content for an image:
11a076400000802 2050 18483044 samplefilename.jpg [/home/whatever/whatever/whatever/samplefilename.jpg]
	Mtime: 0 1970-01-01T01:00:00
	Ctime: 1623231112 2021-06-09T11:31:52
	Cached properties:
		Breite: 2459
		Höhe: 3531

Interne Information
Begriffe: Mimage Mjpeg T4 X26-2459 X27-3531 
Dateinamen-Begriffe: Fjpg samplefilename 
XAttr Begriffe: 
height: 3531
width: 2459

Some had more infos, e.g. tags.
Same for png's.

Some of the pdfs had similar index entries, some had text extracted, I suspect those without text extracted where image pdfs.

All in all it looked pretty normal to me, only there where a few entries with this at the end:
"no index information found". Could those be the culprits?
Comment 10 Martin Tlustos 2021-06-10 14:31:43 UTC
Ah, sorry, no, these are all backup files that are excluded by default...
Comment 11 tagwerk19 2021-06-10 20:23:13 UTC
(In reply to Martin Tlustos from comment #9)
> All in all it looked pretty normal to me...
So I was overly optimistic :-/

However, on the basis that you'd copied the folder and found you'd copied the problem I still suspect that one or more of the files is tripping up the indexing. The question is how to find it.

Maybe see if there's anything in the logs (with journalctl and look for "baloo_fil" entries)? Manually indexing files and seeing if there are any errors (with "balooctl index ...")?

Not sure what more to suggest. Sorry.
Comment 12 Martin Tlustos 2021-06-11 08:02:10 UTC
Ok, did balooctl index * in one of the folders... Some of the files are skipped, because they are already indexed, some are indexed (the same ones that show up when doing balooctl check). No additional errors are shown.
Journald has two warnings and four errors:
11.06.21 08:59	baloo_file_extractor	"Error: Unknown font tag 'ZaDb'"
11.06.21 08:59	baloo_file_extractor	"Error: Unknown font tag 'ZaDb'"
11.06.21 08:59	baloo_file_extractor	"Error: Unknown font tag 'ZaDb'"
11.06.21 08:59	baloo_file_extractor	"Error: Unknown font tag 'ZaDb'"
11.06.21 09:02	baloo_file_extractor	Invalid document structure (meta.xml is missing)
11.06.21 09:02	baloo_file_extractor	Invalid document structure (meta.xml is missing)
The two "meta.xml" messages came from one specific folder where two faulty odt documents were found. After fixing those, the "meta.xml missing" message were gone, but there were still files in that folder that were reindexed, so these were not the cause of the problem.

But I looked through a couple of folders now checking with "balooctl index * | grep 'different file types'" and found that .odg, .odp, .zip, .sem, .kra, .ppt are always reindexed, so it is a file-type problem.

.doc files are sometimes indexed, sometimes not. If I open a doc file that wasn't indexed before in libreoffice and resave it, it is indexed successfully.

So my impression is that some of the file extractors don't work as expected?
Comment 13 Martin Tlustos 2021-06-11 08:03:02 UTC
One more thing: balooctl check will show one .odp file as being reindexed, while balooctl index the same file will say it already is indexed. Strange...
Comment 14 Martin Tlustos 2021-06-11 08:13:07 UTC
A new error showed up:
11.06.21 09:58	dolphin	kf.kio.widgets: Plugin "baloofilepropertiesplugin" is using the deprecated loading style. Please port it to JSON loading.
11.06.21 09:58	dolphin	kf.kio.widgets: Plugin "baloofilepropertiesplugin" is using the deprecated loading style. Please port it to JSON loading.

Maybe that is the reason for some of the filetypes not being indexed?
Comment 15 tagwerk19 2021-06-11 15:51:03 UTC
(In reply to Martin Tlustos from comment #12)
> But I looked through a couple of folders now checking with "balooctl index *
> | grep 'different file types'" and found that .odg, .odp, .zip, .sem, .kra,
> .ppt are always reindexed, so it is a file-type problem.
> 
> .doc files are sometimes indexed, sometimes not. If I open a doc file that
> wasn't indexed before in libreoffice and resave it, it is indexed
> successfully.
> 
> So my impression is that some of the file extractors don't work as expected?
It sounds like it.

If I look in 
    /usr/share/mime/packages/freedesktop.org.xml
It seems that Qt flags .odg and .odp files as 'zipped' files. Nothing specific for the other filetypes - but it might be worth seeing if one or the other starts with
    PK\003\004 

I notice there's new bug, Bug 438455, mentioning .doc (as contrasted to .docx) files.
Comment 16 skierpage 2022-06-29 03:48:04 UTC
In bug  456108 I have a similar problem with baloo constantly reindexing 12 of my files, but all of them have modification time of Jan 1 1970 (0 seconds in Unix epoch), or earlier than that. The reporter here says
> mtime and ctime match in both files
implying this is a different problem, so I filed a separate bug.

> .doc files are sometimes indexed, sometimes not.
> ...
> So my impression is that some of the file extractors don't work as expected?
I'm documenting every baloo limitation I come across at https://community.kde.org/Baloo#Indexing_limitations
Comment 17 Frank Steinmetzger 2024-01-04 19:41:50 UTC
I’ve also been observing this problem for quite some time now. Thankfully, it does not slow down my PC, even though it is 9½ years old. But I see it in the system monitor applets in my panel that there is constant read I/O of 150 MB/s for at least half an hour after login.

I ran balooshow -x on a file before and after the last two reboots and the output was identical.
I don’t run btrfs. My system uses ext4 on LVM on LUKS. The output of lsblk also remained unchanged across boots.
I’m not sure what else to check. I’ve issued balooctl suspend a few minutes ago, but the indexer still chucks along. Eventually I killed the extractor with killall.

```
~ LC_ALL=C time balooctl status
Baloo File Indexer is running
Indexer state: Suspended
Total files indexed: 176,898
Files waiting for content indexing: 74,305
Files failed to index: 0
Current size of index is 18.22 GiB

real    0m49,878s
user    0m0,013s
sys     0m0,004s
```

Addendum:
Reading this thread, I found out about `balooctl monitor` and started it, then resumed the indexing. The monitor first printed some email files and there was minor system load. Then a few seconds nothing and then the I/O load started again, but the monitor has not shown any new filenames since.

Operating System: Arch Linux 
KDE Plasma Version: 5.27.10
KDE Frameworks Version: 5.113.0
Qt Version: 5.15.11
Kernel Version: 6.6.9-arch1-1 (64-bit)
Graphics Platform: X11
Processors: 4 × Intel® Core™ i5-4590 CPU @ 3.30GHz
Memory: 30.8 GiB of RAM
Comment 18 tagwerk19 2024-01-04 20:21:55 UTC
(In reply to Frank Steinmetzger from comment #17)
> Indexer state: Suspended
I wonder what does that nowadays... There used to be a "balooctl suspend" but I think that's been removed.

> ... constant read I/O of 150 MB/s for at least half an hour after login ...
and

> Current size of index is 18.22 GiB 
Gut feeling here is that the systemd limits on RAM are cutting in on you, have a look at what:
    
    systemctl --user status kde-baloo

says. The unit file limits Baloo's RAM use to 512 MB. When Baloo hits that limit it will drop clean pages from its cache so it can load others. You see Baloo slow down and spend a *load* of time and energy reading.

My personal view is that 512 MB is somewhat strict, 50% works for me (together with stopping Baloo using swap)

    MemoryHigh=50%
    MemorySwapMax=0
        
Watch out for indexing email files, particularly those encoded or with attachments. For .eml files see Bug 460882; .mbox files can be absolutely massive.
Comment 19 Frank Steinmetzger 2024-01-23 22:16:51 UTC
(In reply to tagwerk19 from comment #18)

> Watch out for indexing email files, particularly those encoded or with
> attachments. For .eml files see Bug 460882; .mbox files can be absolutely
> massive.

It’s all maildir, but with over 100k files. ^^
I’ve had enough at one point and figured there must be something wrong with my database. So I moved it away and reindexed everything. Seeing that it indexed a lot more files, I think that the database has been in a very old state for quite some time and baloo tried to update it ever since.

There is one problem though: the write volume is very bad. The final database file is maybe around 16 GB (the defunct database was 18 GB), but the write volume during indexing was a multiple of that during indexing, at least 100 GB. So I symlinked ~/.local/share/baloo to a ramdisk.
Comment 20 tagwerk19 2024-01-23 23:24:08 UTC
(In reply to Frank Steinmetzger from comment #19)
> It’s all maildir, but with over 100k files. ^^
A hurried google of "maildir format" gives me that it holds one message per file, with the format like .eml. At least kmimetypefinder gives "message/rfc822". I think Bug 460882 would still apply and you could be writing loads of "random" strings (from encoded attachments, whatever) and repeatedly rewriting the entries for "common terms".
        
If each of your messages has a "Subject" line, a search for "Subject" will retrieve them all. The database record for "Subject" will have been rewritten, with a commit, after each batch of files indexed. That will be a lot of rewriting. Baloo knows this is an issue and batches up and indexes 40 files at a time to cut down on the amount of rewriting required. I suppose, for loads of small files, it could batch up more...