Bug 338402

Summary: File system cache is inneficient : too many file per directory
Product: [Frameworks and Libraries] Akonadi Reporter: roucaries.bastien+bug
Component: serverAssignee: kdepim bugs <kdepim-bugs>
Status: RESOLVED FIXED    
Severity: major CC: dvratil, ehakanduran, martin.steigerwald, Martin
Priority: NOR    
Version: 1.13.0   
Target Milestone: ---   
Platform: Debian testing   
OS: Linux   
Latest Commit: Version Fixed In: 15.12.0

Description roucaries.bastien+bug 2014-08-20 11:49:12 UTC
from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757844

~/.local/share/akonadi/file_db_data/ is completly inneficient and render the whole system slow.

Directory shoud not store more than  a thousand of entry. With my multi gigabyte mail box I have more than 900000 entry in this directory.

See http://etutorials.org/Server+Administration/Squid.+The+definitive+guide/Chapter+7.+Disk+Cache+Basics/7.1+The+cache_dir+Directive/ 
paragraph 7.1.4 L1 and L2 for how to do.

thumbnail cache, git and squid use this technique.

I am tented to raise to RC severity because some filesystem does not support this insane number of file per directory (see for instance ext3 limit here
https://www.mail-archive.com/cwelug@googlegroups.com/msg01944.htm)

Bastien
Comment 1 roucaries.bastien+bug 2014-08-20 11:50:13 UTC
it is really easy to see that jdb kernel thread usage is really high due to this problem
Comment 2 Daniel Vrátil 2014-08-20 12:44:14 UTC
The large folder works very efficiently on common filesystem (ext, btrfs, ...), since we are just making use of the internal hashtree implementation of the filesystem. We always know full name of the file we want, so we never list content of the directory, just directly ask for a specific file, which filesystems are generally very efficient at dealing with.

However we already had a similar report for some remote FS implementation that had a hard limit on maximum amount of files in directory, so I'm inclined to implement one or two levels of folders indirection.

However as the most common filesystem in use on desktops work just fine, I'm not assigning this top priority now. It's definitely something for Frameworks though.

To workaround the issue, you can configure a higher threshold for storing payload in external files, i.e. Akonadi will store more data in the database and put only really large payloads on filesystem. To do so, open ~/.config/akonadi/akonadiserverrc and in [General] section, add

SizeThreshold=16384

This will store externally only payloads larger than 16KB. The default is 4KB.
Comment 3 roucaries.bastien+bug 2014-08-21 11:47:31 UTC
I reput the bug to major. 

This morning I have killed find from updatedb, because find try to list this directory. I/O rate go really high. In this case hash does not help...

I had just changed the right of the directory to something that will not cause this dir to be browsed by updatedb but it is really insane.

Bastien
Comment 4 Martin Steigerwald 2015-01-21 09:07:53 UTC
Daniel, no, the argument that filesystems can handle it, well, with a BTRFS Dual RAID 1 on two SSD, maybe, but the main issue is that it caches that much at all. My bet is:

The usual user has no more than about several thousand mails in short term reference. Even a photo reader wouldn´t have much more in reference at the same time, I bet.

So I fail to find the benefit of caching hundreds of thousand of mails in there. Is there any cache hit/miss statistics? I bet the statistic would be abysmal.

So why does it cache that much *at all*?

This is a huge big "inefficient" looking into your face. And no argument whatsoever will make it go away. No amount of denial that this is just insane, will make it less so. Thats at least my oppinion on it.

At work with the huge IMAP account – on the laptop, I am at home office, so can´t check the workstation, but it was exact that workstation where I moved Akonadi stuff to a local filesystem cause a NetApp Filer storage appliance default limit for maximum files in a directory was exceeded, I was the one reporting that bug:

Before akonadictl fsck I had this:

ms@merkaba:~/.local/share/akonadi> ls -ld file_db_data
drwxr-xr-x 1 ms teamix 109247040 Jan 21 09:23 file_db_data

ms@merkaba:~/.local/share/akonadi> find file_db_data | wc -l 
650280

After it I have:

ms@merkaba:~/.local/share/akonadi> ls -ld file_db_data
drwxr-xr-x 1 ms teamix 109247040 Jan 21 09:32 file_db_data

ms@merkaba:~/.local/share/akonadi#130> find file_db_data | wc -l
524030

So at least the amount of files went down a bit. Even on my POP3 setting I had lots of files in there and even after fsck I had 4600 mails in there. I have local maildir´s for a reason, I´d say and they are fast to access.

But for the work case 524030 cached mails, Dan, can you explain to me the benefit of that? Do you really think I care about 500000+ mails at once? Its an archive of mailing list, yeah, I like full text search there, but Baloo ideally needs to see each mail just once, so why cache those at all? Cache the recent mails of the few folders the user accesses most often and be done with it, I´d say.

Even despite this insane amount of caching I still get situation that KMail does not even respond at all anymore. Granted this is with a huge account and with an Exchange cluster and their IMAP implementation is abysmal, but still, I have 500000+ mails cached locally, *without* setting the IMAP account to disconnected, so why am I not even able to see these then? And why does it come into a situation where I have to restart Akonadi and/or KMail to be able to actually *use* the mail account again and have KMail do something useful? That are real issues with Akonadi not only I have.

So for me its just a *huge* waste of resources without *any* obvious benefit. The issues with Akonadi is not a lack of caching. The issues users still have with Akonadi are elsewhere.

And caching that much can be an issue for BTRFS, even on said Dual SSD RAID 1:

ms@merkaba:~/.local/share/akonadi#130> /usr/bin/time -v du -sh file_db_data
7,0G    file_db_data
        Command being timed: "du -sh file_db_data"
        User time (seconds): 2.17
        System time (seconds): 99.07
        Percent of CPU this job got: 31%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:17.95
                                                                           ^^^^^^^

This is no joke: The du -sch took more than 5 (in words: five minutes) on a BTRFS Dual SSD RAID 1!

Granted the find was way faster (some seconds) and as Akonadi I bet doesn´t count the space in the directory… but still. There is some overhead involved.

        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 33240
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 8076
        Voluntary context switches: 663116

Thats more than 600000 context switches.

        Involuntary context switches: 17704
        Swaps: 0
        File system inputs: 31424208

Thats 31 million file system read requests!

        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Boy, 7 GiB! It cached almost my complete IMAP account here. Without me asking to do it.

Well heck, with that, I am inclined to actually make it an offline IMAP account, cause honestly, where is the difference? And maybe it would work with that Exchange IMAP that lacks both in performance and features implementation (not only with IMAP, also with Trojita). I think, I try this, cause, well it already downloaded it almost completely. I think I will just set this flag now, cause, if it downloads 7 GiB from it, I don´t care about the remaining few GiB it may have not downloaded. I wonder how this all can work for Munich, by the way.

Please, have a way to *limit* the caching to a *sane* value. Or limit it by default.

Thats not sane. Prove me otherwise!

I´d say:

1) On regular and fast IMAP just cache several thousand of mails.

2) On Exchange do that as well and let users switch to disconnected IMAP if Exchange just can´t keep up. Heck, maybe all this caching contributes to making Exchange slow, cause KMail at one point downloaded all these mails. But yeah, one downloading I understand, for the useful desktop search, but then why bother caching all these? Can the downloads for Baloo just be excluded from caching altogether? I´d only cache user requests. There, latency is important. Ideally Baloo should see each mail once, so why cache?

Compared to that file cache the size of the database seems pretty low, but still:

ms@merkaba:~/.local/share/akonadi/db_data/akonadi> du -sh * | sort -rh | head -10
2,6G    parttable.ibd
261M    pimitemtable.ibd
13M     pimitemflagrelation.ibd
264K    collectionattributetable.ibd
200K    collectiontable.ibd
136K    tagtable.ibd
120K    tagtypetable.ibd
120K    tagremoteidresourcerelationtable.ibd
120K    tagattributetable.ibd
120K    resourcetable.ibd

I think I will vacuum it as well.

Akonadi didn´t even respond to the first vacuum request, I restarted Akonadi and issued the request again to make it actually do it.
Comment 5 Martin Steigerwald 2015-01-21 09:26:05 UTC
Dan, I agree that lookup of an individual file will be fast with Ext4, BTRFS, XFS, yet it causes other issues:

- updatedb
- backup: my rsync based backup to eSATA 2 TB harddisk takes way more than one hour now. It was much quicker at some time. It maybe have also different reasons, but my bet that the large file_db_data directories contribute to that. It may likely be way faster with btrfs send/receive, but there are a lot of backup solutions out there, which may have issues with a folder like this.
- any disk usage calculator like filelight will have delays on it (I showed the 5 minute time for du -sch, admitted, this has ecryptfs in between, that visibly seems to make the performance worse)
- so add ecryptfs to the list, I did a du -sch over the uncrypted version and it is way faster than the 5 minutes.
- or maybe NFS based setups as well
- and storage appliances with limits

Its not just Akonadi living alone with a superfast PCIe M2 SSD.

And whats about Akonadi on Windows? I bet NTFS may cope, but do you know? Or Mac OS X with HFS+, how about that?


So arguing modern Linux filesystem can handle these amounts of files is like arguing:

Ey, we have 16 GiB of RAM available, so why don´t reserve and use it all, even if we do not need it.

So what is the rationale for caching that much? What is the actual benefit it provides. I fail to see it.

I will now set that account to offline caching on my laptop. Maybe then I will see some use. And as it already insists on keeping 7 GiB of my IMAP account stored locally, I don´t bother if it also downloads the rest of it.

According to filelight file_db_data is the largest folder! Not even Baloo´s email index is that large. It has just 5,3 GiB. And this one actual provides a benefit. A huge one. And it has it in way less files.

That said, I know found that Icedove, which I used at some time also has 4,2 GiB locally. But I instructed it to keep mails for offline usage, when I remember correctly.
Comment 6 Martin Steigerwald 2015-01-21 09:27:34 UTC
This one may be related:

Bug #341884 - dozens of duplicate mails in ~/.local/share/akonadi/file_db_data 

I will check whether I have dupes in there.
Comment 7 Martin Steigerwald 2015-01-21 09:43:32 UTC
There we go for the bug I reported about this:

Bug 332013 - NFS with NetApp FAS: please split payload files in file_db_data into several directories to avoid reaching maxdirsize limit on Ontap / WAFL filesystem

I´d say, if Akonadi would limit to some thousand files, I would not bother with introducing sublevel directories, but if it will cache 100000+ files in the future as well, I´d introduce it.
Comment 8 Martin Steigerwald 2015-01-21 10:12:11 UTC
Now, enabling disconnected IMAP is really quite useful with Exchange. Actually now seems to actually make use of thats in file_db_data.

So I think there are two needs:

1) Crappy IMAP or network: Use offline IMAP and *cache* all or at least last 30 days or so

2) Good IMAP, fast entwork: Cache less. Much less.

Icedove allows to configure caching in two ways:

1) Amount of days it keeps

2) Maximum size of message it will try to download.
Comment 9 Martin Steigerwald 2015-01-23 14:38:34 UTC
With the SizeTreshold=32768 change I get a nice improvement for my work IMAP account on the laptop (the one I set to download all mails offline).¹

Before:

ms@merkaba:~/.local/share/akonadi> du -sch db_data/akonadi/* | sort -rh | head -10
2,8G    insgesamt
2,6G    db_data/akonadi/parttable.ibd
245M    db_data/akonadi/pimitemtable.ibd
13M     db_data/akonadi/pimitemflagrelation.ibd
248K    db_data/akonadi/collectionattributetable.ibd
200K    db_data/akonadi/collectiontable.ibd
136K    db_data/akonadi/tagtable.ibd
120K    db_data/akonadi/tagtypetable.ibd
120K    db_data/akonadi/tagremoteidresourcerelationtable.ibd
120K    db_data/akonadi/tagattributetable.ibd

ms@merkaba:~/.local/share/akonadi> find file_db_data | wc -l
524917


ms@merkaba:~/.local/share/akonadi#130> /usr/bin/time -v du -sch file_db_data
7,0G    file_db_data
7,0G    insgesamt
        Command being timed: "du -sch file_db_data"
        User time (seconds): 2.14
        System time (seconds): 95.93
        Percent of CPU this job got: 29%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:35.47
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 33444
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 8079
        Voluntary context switches: 667562
        Involuntary context switches: 60715
        Swaps: 0
        File system inputs: 31509216
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

After the change and akonadictl fsck:

ms@merkaba:~/.local/share/akonadi> find file_db_data | wc -l ;  du -sch db_data/akonadi/* | sort -rh | head -10
27
7,5G    insgesamt
7,3G    db_data/akonadi/parttable.ibd
245M    db_data/akonadi/pimitemtable.ibd
13M     db_data/akonadi/pimitemflagrelation.ibd
248K    db_data/akonadi/collectionattributetable.ibd
200K    db_data/akonadi/collectiontable.ibd
136K    db_data/akonadi/tagtable.ibd
120K    db_data/akonadi/tagtypetable.ibd
120K    db_data/akonadi/tagremoteidresourcerelationtable.ibd
120K    db_data/akonadi/tagattributetable.ibd

Yep, thats 27 files, instead of >500000 (after just a week of the last fsck, which reduced to about 500000 files, from 650000+).

After a nice vacuuming I even get:

ms@merkaba:~/.local/share/akonadi> find file_db_data | wc -l ;  du -sch db_data/akonadi/* | sort -rh | head -10
27
6,5G    insgesamt
6,2G    db_data/akonadi/parttable.ibd
245M    db_data/akonadi/pimitemtable.ibd
13M     db_data/akonadi/pimitemflagrelation.ibd
248K    db_data/akonadi/collectionattributetable.ibd
200K    db_data/akonadi/collectiontable.ibd
136K    db_data/akonadi/tagtable.ibd
120K    db_data/akonadi/tagtypetable.ibd
120K    db_data/akonadi/tagremoteidresourcerelationtable.ibd
120K    db_data/akonadi/tagattributetable.ibd

merkaba:/home/ms/.local/share/akonadi> du -sh file_db_data 
6,5M    file_db_data


I definitely prefer this over the original situation.

Original was 2,8 GiB DB + 7 GiB file_db_local.

Now is 6,5 GiB DB + 6,5 file_db_local and more than 524000 files less to consider for rsync and our enterprise backup software.

Let's see whether it brings an performance enhancement, but for now I like this.


[1] Re: Possible akonadi problem?
From: Dmitry Smirnov
Date: Fri, 23 Jan 2015 07:17:13 +1100
Message-id: <10777191.dsSRuLrDof@debstor>

https://lists.debian.org/debian-kde/2015/01/msg00055.html

Thanks,
Martin
Comment 10 E. Hakan Duran 2015-01-29 03:41:21 UTC
Hi all,

I don't have technical background so please excuse me if I cannot express myself eloquently. A few days ago, I ran out of space in my /home directory and discovered that the ~/.local/share/akonadi/file_db_data directory contains 422,359 files, which totals to 75 GiB. I ran akonadictl fsck, which found several duplicates but didn't change the size of this directory significantly. I have 7 offline IMAP accounts, and I rarely receive emails that contain attachments >= 5MB. I compressed and moved the directory to another HDD, and akonadi recreated it during the next launch, immediately populating it with 2.7 GiB data. That is when I decided to convert all my offline IMAP accounts to online ones, and erased the folder again. Since then 92.3 MB data in 15,310 files accumulated in that directory. Of note, I haven't modified the original size threshold  of 4KB. I hope this information helps to this discussion.

Thanks,

Hakan
Comment 11 Martin Steigerwald 2015-03-12 13:12:20 UTC
This still seems to work for me:

Since my work-around with SizeThreshold=32768 on my private setup:

martin@merkaba:~/.local/share/akonadi> find file_db_data | wc -l
33
martin@merkaba:~/.local/share/akonadi> du -sh file_db_data 
4,4M    file_db_data
martin@merkaba:~/.local/share/akonadi> du -sh db_data
2,7G    db_data

martin@merkaba:~/.local/share/akonadi> du -sh file_db_data/* | sort -rh | head -10
896K    file_db_data/2815963_r0
584K    file_db_data/2630687_r0
452K    file_db_data/2630658_r0
448K    file_db_data/2630655_r0
368K    file_db_data/2488539_r0
164K    file_db_data/2758167_r0
152K    file_db_data/2488220_r0
120K    file_db_data/2488152_r0
104K    file_db_data/2565672_r0
100K    file_db_data/2943194_r0


And on my laptop work setup:

ms@merkaba:~/.local/share/akonadi> find file_db_data | wc -l
2442

ms@merkaba:~/.local/share/akonadi> du -sh file_db_data 
703M    file_db_data

ms@merkaba:~/.local/share/akonadi> du -sh db_data 
8,2G    db_data

still quite much, but I set the account to offline caching and I think it cached mails with larger attachments:

ms@merkaba:~/.local/share/akonadi> du -sh file_db_data/* | sort -rh | head -10
11M     file_db_data/4735240_r1
11M     file_db_data/4735240_r0
11M     file_db_data/4735239_r1
11M     file_db_data/4735239_r0
9,4M    file_db_data/4734016_r1
9,4M    file_db_data/4734016_r0
9,3M    file_db_data/4731257_r0
8,6M    file_db_data/4731853_r2
8,6M    file_db_data/4731853_r1
8,6M    file_db_data/4731853_r0

And well as to the database size: Outlook Web Access reports more than 20 GiB for the mail account. So 8.2 GiB MySQL database is not all that much. I am in the process of achivemail´ing old mails from it and KMail / Akonadi didn´t yet pick up the lower mail counts in some folders, so it may still have things cached that are not on the server anymore.
Comment 12 Daniel Vrátil 2015-06-29 20:59:45 UTC
Git commit 9c0dc6b3f0826d32eac310b2e7ecd858ca3df681 by Dan Vrátil.
Committed on 29/06/2015 at 20:45.
Pushed by dvratil into branch '1.13'.

Don't leak old external payload files

Actually delete old payload files after we increase the payload revision or
switch from external to internal payload. This caused ~/.local/share/akonadi/file_db_data
to grow insanely for all users, leaving them with many duplicated files (just with
different revisions).

It is recommended that users run akonadictl fsck to clean up the leaked payload
files.

Note that there won't be any more releases of Akonadi 1.13 (and this has been
fixed in master already), so I strongly recommend distributions to pick this
patch into their packaging.
Related: bug 341884

M  +14   -0    server/src/storage/partstreamer.cpp
M  +13   -11   server/tests/unittest/partstreamertest.cpp

http://commits.kde.org/akonadi/9c0dc6b3f0826d32eac310b2e7ecd858ca3df681
Comment 13 Daniel Vrátil 2015-06-30 20:19:27 UTC
Hi all,

so the problem with files just endlessly piling up in file_db_data should finally be fixed. Now for the original bug report: the file_db_data containing too many files.

So I was thinking about how to decrease the file count - obviously the right solution is the levelled cache as Bastien pointed out. I think in our case one level should be enough. The filenames of the external payload parts consist of incremental database unique ID, so I am planning to use the last two digits of the ID for the folder name to ensure even distribution of files into the cache folders.

To have some numbers here, using modulo 100 for folder name means 100 folders in file_db_data (file_db_data/00 - 99/). With 1 million emails in Akonadi (which is a performance baseline for me) and with average ratio of external vs internal cache being cca 1:2 we get cca 500 000 external files in file_db_data, so that means cca 5 000 files per folder. That sounds like a reasonable number to me. 

With 2 levels of indirection (using last 3rd and 2nd digit for L1 and last digit for L2 (so file_db_data/00-99/0-9/) we would have 100 folders with 10 folders in each so 1000 folders in total. That would give us cca 500 emails per folder with 1 000 000 emails in Akonadi. For the baseline of 1 000 000 emails two levels of indirection seem to be unnecessary.

Regarding migration, we cannot do automatic migration on start, that would take too much time and resources to perform during start, so only newly created files would be moved to the cache folders. It would be possible to implement the full migration as part of akonadictl fsck though, so users could run it manually.

What are your opinions?
Comment 14 Martin Steigerwald 2015-07-01 07:46:54 UTC
(In reply to Daniel Vrátil from comment #13)
> Hi all,
> 
> so the problem with files just endlessly piling up in file_db_data should
> finally be fixed. Now for the original bug report: the file_db_data
> containing too many files.
> 
> So I was thinking about how to decrease the file count - obviously the right
> solution is the levelled cache as Bastien pointed out. I think in our case
> one level should be enough. The filenames of the external payload parts
> consist of incremental database unique ID, so I am planning to use the last
> two digits of the ID for the folder name to ensure even distribution of
> files into the cache folders.

I don´t know how the IDs are computed. So I take your call on what part of it to use.

> To have some numbers here, using modulo 100 for folder name means 100
> folders in file_db_data (file_db_data/00 - 99/). With 1 million emails in
> Akonadi (which is a performance baseline for me) and with average ratio of
> external vs internal cache being cca 1:2 we get cca 500 000 external files
> in file_db_data, so that means cca 5 000 files per folder. That sounds like
> a reasonable number to me. 
>
> With 2 levels of indirection (using last 3rd and 2nd digit for L1 and last
> digit for L2 (so file_db_data/00-99/0-9/) we would have 100 folders with 10
> folders in each so 1000 folders in total. That would give us cca 500 emails
> per folder with 1 000 000 emails in Akonadi. For the baseline of 1 000 000
> emails two levels of indirection seem to be unnecessary.

That sounds reasonable. 1 million mails sounds like a reasonable baseline. Many users will have less. My accounts have a bit more, but even with 2 million mails and and a ratio of 1:2 it will be just 10000 files a folder.
 
> Regarding migration, we cannot do automatic migration on start, that would
> take too much time and resources to perform during start, so only newly
> created files would be moved to the cache folders. It would be possible to
> implement the full migration as part of akonadictl fsck though, so users
> could run it manually.
> 
> What are your opinions?

I like this idea in general. I just wonder whether to combine it with a larger default threshold size externally. Cause, I use

[%General]
Driver=QMYSQL
SizeThreshold=32768

and except for loosing files that you already fixed, this worked really well for me. Maybe 32768 is a bit too agressive, but I really wonder whether the default of 4 KiB is the best value to choose. Maybe 8 KiB would be good as I bet many mails are just a tad bid larger than 4 KiB. I think I will try to dig out a file size statistic tool to measure the mail sizes in my local Maildir for my private POP3 account with more 1 million mails.

Of course, this change can be made independendently, but as it directly affects the number of files in filesystem cache, I thought I mention it here. But well so or so, if the default value is raised, it will be less files, so one level of indirection is enough and it can really be done independently. Might be better as well in order to test the impact of each change independently. I would also remove my custom setting in order to test this change.
Comment 15 Daniel Vrátil 2015-08-21 01:19:18 UTC
commit cb24efd05824d7f5aa7218d086b5692eee05d6c5
Author: Dan Vrátil <dvratil@redhat.com>
Date:   Fri Aug 21 02:46:21 2015 +0200

    Refactor external payload parts handling and implement levelled cache
    
    External payload files are now stored in levelled folder hierarchy. Currently we
    implement one level of indirection using modulo 100 of Part.id(). Using modulo
    100 ensures even distribution of files into the subdirectories. The migration is
    implemented in StorageJanitor, so it can be triggered manually by running
    akonadictl fsck.
    
    Handling of access to external files has been refactored to ExternalPartStorage
    class. This class implements access to the legacy flat-cache hierarchy as well
    as simple transactional system. The transactional system allows us to rollback
    or commit changes in the external files: file created in the transaction are
    deleted when the transaction is rolled back, and file deletion is delayed until
    the transaction is committed. This allows us to control the files from outside
    PartStreamer and tie it to committing of database transaction, which is more
    likely to fail than the EPS transaction. This should prevent us from losing
    cached parts when error occurs during update.