While doing some tests for bug ids https://bugs.kde.org/show_bug.cgi?id=320666 and https://bugs.kde.org/show_bug.cgi?id=374191 , I moved some image to trash to make sure that the similarities to this image are deleted. While the similarities to this image were deleted, the similarities of the image to other images were not deleted. Moving to trash thus obviously does not delete the image properties for the image moved to trash. Then, I deleted the image from trash but the image properties were still there. After inspecting the SQLite DB (shoud be the same for MySQL), I learned that the file was deleted from the hard disk but the image info in the database was not. The image is still in database with Album set to NULL and status set to hidden. Thus, the trigger which removes the image properties was not activated. This does not make any sense. Deleting an image from trash or directly should also remove the entry from the images table. Step to reproduce: 1) get the image id from some arbitrary image from the DB 2) add some image property for this image in the ImageProperties table 3) move the image to trash and delete it The image id should still be in the DB I inspected the source code. The imageviewutilities trigger the deletion of the file via DIO but the deleteItem function of the core db class is not triggered.
There is another thing: When I move the image to trash and use the recover button, a new entry is created in the images table. But the deleted image is still there.
(In reply to Mario Frank from comment #1) > There is another thing: > When I move the image to trash and use the recover button, a new entry is > created in the images table. But the deleted image is still there. I assume this is done since the former album id is not stored. Shouldn't it be possible to get the former album by fetching the path from the dtrash info? Also, moving pictures from one album to another leads to new Images table entries. This wastes many image ids. Couldn't this be solved by just setting the new album id?
Mario, you is right. In fact, we need a garbage collector, as it's explained in this file : https://bugs.kde.org/show_bug.cgi?id=317210 ...even if the DB transactions must be improved when action are processed. Gilles
(In reply to caulier.gilles from comment #3) > Mario, you is right. > > In fact, we need a garbage collector, as it's explained in this file : > > https://bugs.kde.org/show_bug.cgi?id=317210 > > ...even if the DB transactions must be improved when action are processed. > > Gilles Yes, I see. A garbage collection would be helpful. But some cleanup could be already done without garbage collection. There are some spots I wanted to improve anyway. But I will leave this for after the release. Did you see the patch (https://bugs.kde.org/attachment.cgi?id=103203) for https://bugs.kde.org/show_bug.cgi?id=374191 (context menu in tags and people sidebar? It also improves the way similarity information is handled in table view, image view and for sorting. The new approach is much more stable and the functionality is now tested quite excessively. Also, the DB handling seems to work as expected apart from the problems described in this bug here. I would like to commit the patch. Cheers, Mario
Oh, yes. The complete patch description: https://bugs.kde.org/show_bug.cgi?id=320666#c17
Created attachment 103766 [details] This patch introduces garbage collection as maintenance stage and reduces the amount of generated garbage. This patch introduces garbage collection as maintenance stage and reduces the amount of generated garbage. The stage in which the garbage collector is run is before the rebuilding of thumbnails. Description of problems and approach. The garbage that bloated my databases was quite annoying. I want to draw a sketch about where the garbage comes from: 1) Move image to trash. Everytime I delete some image, the Images table entry is set to status Removed (3) and the original album id is removed from the entry. If I restore the image from trash, a new Images entry is generated. 2) Deleting images/albums directly or from trash. The image files are deleted from hard drive but the Images entries are not removed (and also ImageTagProperties and so on. 3) Moving/renaming images creates a new Images entry. The old one is set to status Removed. 4) Deleting images does not remove the thumbnails. (by path/ uniqueHash) 5) Deleting face regions does not remove the region thumbnails (custom identifier) 6) Removing tag does not remove identities from recognition db (every identity should have the same faceEngineUUID as a tag) Here is a description about what the patch does: 1) Creating less junk: I introduced a new item status "Obsolete" and renamed the status "Removed" to "Trashed". Items are set to status Trashed if they are moved to trash. If items are deleted directly/permanently, they get the status "Obsolete". If an image is restored from trash, i search for an item entry that has status Trashed and has the same properties as the new one. If i find such one, I use this entry and set the new/old album and the status to visible. If an image is renamed/moved, I use the moveItem method of the core DB to set the new album/name of the image. This way, the ImageScanner does not think that this is a new image. The old entry is reused. This could solve the grouping problem. I cannot solve points 4 to 6 in the same easy way, explicitely not the thumbnails problem since thumbnails can be referenced by image path, image uniqueHash/file size and image path/face region (custom identifier) Thus, I made some clean-up routine for our databases. 2) Collecting junk: I implemented the DbCleaner Maintenance module. It runs at every start of digikam (if configured so in setup->misc) and removes all stale Images entries (detectable by status Obsolete). This does not take much time. But the DbCleaner can do more. In Maintenance dialog, I added a stage Database Cleanup, which can be triggered. The stage can also clean the thumbs db and recognition db. But this must be explicitely selected in the menu as this can take more time. Also, the thumbs and recognition db are never cleaned at the start of digikam. I do not want our users to wait minutes until they can work. Now to what the DbCleaner does. As already said, it removes the stale images. But let's take a step back. Getting the stale images is just one call to core db. Getting the stale thumbnails is more complicated. Getting the stale face identities is less complicated. In first phase of the DbCleaner I analyse the databases (thumbs and recognition only if enabled). Identities are stale if there is no tag in core db that has the same faceEngineUUID as the identity. Thumbnails are stale if the following holds: 1) There is no image in core DB whose file path leads to that thumbnail (FilePaths table) 2) There is no image in core DB whose uniqueHash and file size leads to that thumbnail (UniqueHashes table) 3) There is no face region of an image whose custom identifier (image file path + region) leads to that thumbnail (CustomIdentifiers table) So I first get all thumbnail ids from thumbs db into a list A and all image ids into another list B. Then I get the thumbnail ids for every image by their file path, uniqueHash/file size and also the thumbnails for the face regions and remove those thumbnail ids from my list A. The remainder in the list is thus neither connected to an image/video nor to face regions. I know that this is no really efficient way. But if, let's say a face region is deleted, I cannot delete the thumbnail since it could still be used from some other image by file path for example. When I am done with that, I first clean the core db (stale images) After that, I clean the thumbs db and after that the recognition db. So far for the main process. The progress is shown to the user and I show, what currently is done (analyse, clean core DB, clean thumbs DB, clean recognition DB). Then I tested my implementation with my database. I have got about 40000 images and my thumbnail db contains 96000 FilePath entries, 205000 UniqueHashes entries and 180000 CustomIdentifier entries. The Thumbnails entries are about 255000. File size of the database (SQLite) file is 2.9 GB About 200000 of the thumbnails are recognised as stale. Removing the thumbnails one by one per thread as it is done for thumbnail generation for example took exorbitant amount of time. To be frank (pun intended), I gave up after 2 hours. The context swithing while multi-threaded access to the thumbs db does not seem to work (well) is a pain. Then I adopted my cleaner threads to work in chunks, with chunk size modifiable (currently hard-coded). This worked better but still too much threads that wait for IO. Against all expectations, I found out that completely sequential cleaning is the fastest. If i let one worker thread remove all stale thumbnails, the process of cleaning my complete databases takes only about 8 mins on my 3 years old core i5 (with 16 GB RAM and no SSD). After vacuuming, my thumbs db has only size of 650 MB.
To all who test this patch: Please backup your databases before you test. Though I made tests without any problems, your databases differ from mine. And I do not want you to lose data. The core database is only slightly touched, i.e. really removed image entries are deleted. The most work is done in thumbnails db.
Created attachment 103773 [details] Patch including new files.
do you have by any chance a branch on a git repo to check out from? I use an external mysql server and could test it with that configuration When you delete a tagged photo and restart digikam, the tags are gone forever, right? It wouldn't be possible to restore the photo including tags from the trash. What about a retention period (e.g 30 days), or if the deletion date is not saved, just clean up once every month instead of every launch of digikam(just as an idea)
Kusi, The branch is already created : https://cgit.kde.org/digikam.git/log/?h=development/garbagecollection Anyway, if you wait a little bit, i currently re-build the AppImage bundle with this branch. Gilles Caulier
AppImage is done. Upploading to GDrive under progress. It will be online in few minutes at usual place : https://drive.google.com/drive/folders/0BzeiVr-byqt5Y0tIRWVWelRJenM New database Garbage Collector options are there : https://www.flickr.com/photos/digikam/32549923912/in/dateposted-public/ https://www.flickr.com/photos/digikam/32549923632/in/dateposted-public/ Gilles Caulier
(In reply to Kusi from comment #9) > do you have by any chance a branch on a git repo to check out from? I use an > external mysql server and could test it with that configuration > > When you delete a tagged photo and restart digikam, the tags are gone > forever, right? It wouldn't be possible to restore the photo including tags > from the trash. What about a retention period (e.g 30 days), or if the > deletion date is not saved, just clean up once every month instead of every > launch of digikam(just as an idea) Kusi, I never would automatically delete items from trash. I think this is bad usability. I only delete the items that were either removed directly (not moved to trash) or removed from trash. So, as long as you do not delete the items from trash, I will not do this either.
What about the garbage collection branch. When will we merge it to master? I identified some potential stale data, i.e. table entries that reference a non-existent image id, in core DB that can also be collected. I will implement that when I have the time.
Hi Mario, Well, merge back will be done when you decide. If more time is necessary to hack, we can postbone release date for 5.5.0. In other word, we don't have any restriction here (:=)))... Gilles
(In reply to caulier.gilles from comment #14) > Hi Mario, > > Well, merge back will be done when you decide. If more time is necessary to > hack, we can postbone release date for 5.5.0. > > In other word, we don't have any restriction here (:=)))... > > Gilles Hey Gilles, the stale data (in ImageTags) seems to be the result of some queries I made in the DB with my DB browser. I did not find more stale data in core DB that can be cleaned. I just resolved the last thing I disliked. Before 5.5, items were marked as Removed (now Trashed) and deleted after a retention period. So users may have items with that status. But the garbage collection only removes Obsolete items. I adopted the CollectionsScanner to marke trashed items as obsolete after a retention period. This way, users will also get rid of old trashed items. Also, there are no more tables in thumbs and recognition db that are not handled by the garbage collection. To conclude: in functionality, we are finished for now. I am in fact working with the garbage collection since we have the branch and did not encounter problems. I would merge the branch into the master today. Then, more people can test. No need to postpone 5.5 - except users raid us with bug requests.
Merge is done. I will close this file now
Git commit a1f67531b3269941df5ff531baa3e487cf58f1fc by Mario Frank. Committed on 08/02/2017 at 14:09. Pushed by mfrank into branch 'master'. Merged the garbage collection into master. The garbage collector is a maintenance stage that runs before the rebuild of thumbnails and must be triggered explicitely. It removes stale image entries in core db and if enabled also stale thumbnails and face identities from thumbnails and recognition DB. If configured so, the core DB part of the garbage collector removes stale image entries in core db during the start of digiKam. Note that cleaning the databases does not necessarily make them smaller as no auto-vacuum is done on the databases. The vacuuming proces differs highly between the three supported database variants (SQLite, internal MySQL and external MySQL). Thus, currently there is no automatism. Related: bug 323718 FIXED-IN: 5.5.0 M +3 -1 NEWS https://commits.kde.org/digikam/a1f67531b3269941df5ff531baa3e487cf58f1fc