Bug 374591

Summary: Deleting image only removes the file and sets the status to hidden but does not delete the image from DB [patch]
Product: [Applications] digikam Reporter: Mario Frank <mario.frank>
Component: Database-TrashAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED FIXED    
Severity: normal CC: caulier.gilles, kusi, mario.frank
Priority: NOR    
Version: 5.5.0   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In: 5.5.0
Sentry Crash Report:
Attachments: This patch introduces garbage collection as maintenance stage and reduces the amount of generated garbage.
Patch including new files.

Description Mario Frank 2017-01-05 12:52:46 UTC
While doing some tests for bug ids https://bugs.kde.org/show_bug.cgi?id=320666 and https://bugs.kde.org/show_bug.cgi?id=374191 , I moved some image to trash to make sure that the similarities to this image are deleted. 
While the similarities to this image were deleted, the similarities of the image to other images were not deleted.

Moving to trash thus obviously does not delete the image properties for the image moved to trash.

Then, I deleted the image from trash but the image properties were still there.
After inspecting the SQLite DB (shoud be the same for MySQL), I learned that the file was deleted from the hard disk but the image info in the database was not. The image is still in database with Album set to NULL and status set to hidden. 

Thus, the trigger which removes the image properties was not activated.

This does not make any sense. Deleting an image from trash or directly should also remove the entry from the images table.

Step to reproduce:
1) get the image id from some arbitrary image from the DB
2) add some image property for this image in the ImageProperties table
3) move the image to trash and delete it
The image id should still be in the DB

I inspected the source code. The imageviewutilities trigger the deletion of the file via DIO but the deleteItem function of the core db class is not triggered.
Comment 1 Mario Frank 2017-01-05 13:01:55 UTC
There is another thing:
When I move the image to trash and use the recover button, a new entry is created in the images table. But the deleted image is still there.
Comment 2 Mario Frank 2017-01-05 15:04:21 UTC
(In reply to Mario Frank from comment #1)
> There is another thing:
> When I move the image to trash and use the recover button, a new entry is
> created in the images table. But the deleted image is still there.

I assume this is done since the former album id is not stored. Shouldn't it be possible to get the former album by fetching the path from the dtrash info?

Also, moving pictures from one album to another leads to new Images table entries. This wastes many image ids. Couldn't this be solved by just setting the new album id?
Comment 3 caulier.gilles 2017-01-05 18:16:28 UTC
Mario, you is right.

In fact, we need a garbage collector, as it's explained in this file :

https://bugs.kde.org/show_bug.cgi?id=317210

...even if the DB transactions must be improved when action are processed.

Gilles
Comment 4 Mario Frank 2017-01-05 22:19:50 UTC
(In reply to caulier.gilles from comment #3)
> Mario, you is right.
> 
> In fact, we need a garbage collector, as it's explained in this file :
> 
> https://bugs.kde.org/show_bug.cgi?id=317210
> 
> ...even if the DB transactions must be improved when action are processed.
> 
> Gilles

Yes, I see. A garbage collection would be helpful. But some cleanup could be already done without garbage collection.

There are some spots I wanted to improve anyway. But I will leave this for after the release. 

Did you see the patch (https://bugs.kde.org/attachment.cgi?id=103203) for https://bugs.kde.org/show_bug.cgi?id=374191 (context menu in tags and people sidebar?

It also improves the way similarity information is handled in table view, image view and for sorting. The new approach is much more stable and the functionality is now tested quite excessively. Also, the DB handling seems to work as expected apart from the problems described in this bug here. 
I would like to commit the patch.

Cheers,
Mario
Comment 5 Mario Frank 2017-01-05 22:21:42 UTC
Oh, yes. The complete patch description: https://bugs.kde.org/show_bug.cgi?id=320666#c17
Comment 6 Mario Frank 2017-02-02 08:52:21 UTC
Created attachment 103766 [details]
This patch introduces garbage collection as maintenance stage and reduces the amount of generated garbage.

This patch introduces garbage collection as maintenance stage and reduces the amount of generated garbage.
The stage in which the garbage collector is run is before the rebuilding of thumbnails.

Description of problems and approach.

The garbage that bloated my databases was quite annoying. I want to draw
a sketch about where the garbage comes from:

1) Move image to trash. Everytime I delete some image, the Images table
entry is set to status Removed (3) and
the original album id is removed from the entry.
If I restore the image from trash, a new Images entry is generated.

2) Deleting images/albums directly or from trash. The image files are
deleted from hard drive but the Images entries
are not removed (and also ImageTagProperties and so on.

3) Moving/renaming images creates a new Images entry. The old one is set
to status Removed.

4) Deleting images does not remove the thumbnails. (by path/ uniqueHash)

5) Deleting face regions does not remove the region thumbnails (custom
identifier)

6) Removing tag does not remove identities from recognition db (every
identity should have the same faceEngineUUID as a tag)

Here is a description about what the patch does:
1) Creating less junk:
I introduced a new item status "Obsolete" and renamed the status
"Removed" to "Trashed".
Items are set to status Trashed if they are moved to trash. If items are
deleted directly/permanently,
they get the status "Obsolete".

If an image is restored from trash, i search for an item entry that has
status Trashed and has the same properties
as the new one. If i find such one, I use this entry and set the new/old
album and the status to visible.

If an image is renamed/moved, I use the moveItem method of the core DB
to set the new album/name of the image.
This way, the ImageScanner does not think that this is a new image. The
old entry is reused.
This could solve the grouping problem.

I cannot solve points 4 to 6 in the same easy way, explicitely not the
thumbnails problem since thumbnails can
be referenced by image path, image uniqueHash/file size and image
path/face region (custom identifier)

Thus, I made some clean-up routine for our databases.

2) Collecting junk:
I implemented the DbCleaner Maintenance module. It runs at every start
of digikam (if configured so in setup->misc) and removes
all stale Images entries (detectable by status Obsolete). This does not
take much time. But the DbCleaner can do more.
In Maintenance dialog, I added a stage Database Cleanup, which can be
triggered. The stage can also clean the thumbs db
and recognition db. But this must be explicitely selected in the menu as
this can take more time. Also, the thumbs and
recognition db are never cleaned at the start of digikam. I do not want
our users to wait minutes until they can work.

Now to what the DbCleaner does. As already said, it removes the stale
images. But let's take a step back. Getting the
stale images is just one call to core db. Getting the stale thumbnails
is more complicated. Getting the stale face identities
is less complicated. In first phase of the DbCleaner I analyse the
databases (thumbs and recognition only if enabled).
Identities are stale if there is no tag in core db that has the same
faceEngineUUID as the identity.
Thumbnails are stale if the following holds:
    1) There is no image in core DB whose file path leads to that
thumbnail (FilePaths table)
    2) There is no image in core DB whose uniqueHash and file size leads
to that thumbnail (UniqueHashes table)
    3) There is no face region of an image whose custom identifier
(image file path + region) leads to that thumbnail (CustomIdentifiers table)

So I first get all thumbnail ids from thumbs db into a list A and all
image ids into another list B.
Then I get the thumbnail ids for every image by their file path,
uniqueHash/file size and also
the thumbnails for the face regions and remove those thumbnail ids from
my list A.
The remainder in the list is thus neither connected to an image/video
nor to face regions.
I know that this is no really efficient way. But if, let's say a face
region is deleted, I cannot delete the thumbnail since it could
still be used from some other image by file path for example.

When I am done with that, I first clean the core db (stale images) After
that, I clean the thumbs db and after that the
recognition db. So far for the main process. The progress is shown to
the user and I show, what currently is done (analyse,
clean core DB, clean thumbs DB, clean recognition DB).

Then I tested my implementation with my database. I have got about 40000
images and my thumbnail db contains
96000 FilePath entries, 205000 UniqueHashes entries and 180000
CustomIdentifier entries. The Thumbnails entries are
about 255000. File size of the database (SQLite) file is 2.9 GB

About 200000 of the thumbnails are recognised as stale. Removing the
thumbnails one by one per thread as it is done for thumbnail
generation for example took exorbitant amount of time. To be frank (pun
intended), I gave up after 2 hours. The context swithing while
multi-threaded access to the thumbs db does not seem to work (well) is a
pain. Then I adopted my cleaner threads to work in chunks,
with chunk size modifiable (currently hard-coded). This worked better
but still too much threads that wait for IO. Against all expectations,
I found out that completely sequential cleaning is the fastest. If i let
one worker thread remove all stale thumbnails, the process of cleaning
my complete databases takes only about 8 mins on my 3 years old core i5
(with 16 GB RAM and no SSD). After vacuuming, my thumbs db has only size 
of 650 MB.
Comment 7 Mario Frank 2017-02-02 08:54:56 UTC
To all who test this patch: Please backup your databases before you test. Though I made tests without any problems, your databases differ from mine. And I do not want you to lose data. The core database is only slightly touched, i.e. really removed image entries are deleted. The most work is done in thumbnails db.
Comment 8 Mario Frank 2017-02-02 12:16:09 UTC
Created attachment 103773 [details]
Patch including new files.
Comment 9 Kusi 2017-02-04 11:37:50 UTC
do you have by any chance a branch on a git repo to check out from? I use an external mysql server and could test it with that configuration

When you delete a tagged photo and restart digikam, the tags are gone forever, right? It wouldn't be possible to restore the photo including tags from the trash. What about a retention period (e.g 30 days), or if the deletion date is not saved, just clean up once every month instead of every launch of digikam(just as an idea)
Comment 10 caulier.gilles 2017-02-04 12:31:55 UTC
Kusi,

The branch is already created :

https://cgit.kde.org/digikam.git/log/?h=development/garbagecollection

Anyway, if you wait a little bit, i currently re-build the AppImage bundle with this branch.

Gilles Caulier
Comment 11 caulier.gilles 2017-02-04 14:03:54 UTC
AppImage is done.
Upploading to GDrive under progress. It will be online in few minutes at usual place :

https://drive.google.com/drive/folders/0BzeiVr-byqt5Y0tIRWVWelRJenM

New database Garbage Collector options are there :

https://www.flickr.com/photos/digikam/32549923912/in/dateposted-public/
https://www.flickr.com/photos/digikam/32549923632/in/dateposted-public/

Gilles Caulier
Comment 12 Mario Frank 2017-02-04 15:37:23 UTC
(In reply to Kusi from comment #9)
> do you have by any chance a branch on a git repo to check out from? I use an
> external mysql server and could test it with that configuration
> 
> When you delete a tagged photo and restart digikam, the tags are gone
> forever, right? It wouldn't be possible to restore the photo including tags
> from the trash. What about a retention period (e.g 30 days), or if the
> deletion date is not saved, just clean up once every month instead of every
> launch of digikam(just as an idea)

Kusi,
I never would automatically delete items from trash. I think this is bad usability. I only delete the items that were either removed directly (not moved to trash) or removed from trash. So, as long as you do not delete the items from trash, I will not do this either.
Comment 13 Mario Frank 2017-02-08 08:23:44 UTC
What about the garbage collection branch. When will we merge it to master?
I identified some potential stale data, i.e. table entries that reference a non-existent image id, in core DB that can also be collected.
I will implement that when I have the time.
Comment 14 caulier.gilles 2017-02-08 11:15:11 UTC
Hi Mario,

Well, merge back will be done when you decide. If more time is necessary to hack, we can postbone release date for 5.5.0.

In other word, we don't have any restriction here (:=)))...

Gilles
Comment 15 Mario Frank 2017-02-08 13:28:26 UTC
(In reply to caulier.gilles from comment #14)
> Hi Mario,
> 
> Well, merge back will be done when you decide. If more time is necessary to
> hack, we can postbone release date for 5.5.0.
> 
> In other word, we don't have any restriction here (:=)))...
> 
> Gilles

Hey Gilles,
the stale data (in ImageTags) seems to be the result of some queries I made in the DB with my DB browser.
I did not find more stale data in core DB that can be cleaned.
I just resolved the last thing I disliked. Before 5.5, items were marked as Removed (now Trashed) and deleted after a retention period. So users may have items with that status. But the garbage collection only removes Obsolete items. 
I adopted the CollectionsScanner to marke trashed items as obsolete after a retention period. This way, users will also get rid of old trashed items.

Also, there are no more tables in thumbs and recognition db that are not handled by the garbage collection.

To conclude: in functionality, we are finished for now. I am in fact working with the garbage collection since we have the branch and did not encounter problems. I would merge the branch into the master today. Then, more people can test.

No need to postpone 5.5 - except users raid us with bug requests.
Comment 16 Mario Frank 2017-02-08 14:05:54 UTC
Merge is done. I will close this file now
Comment 17 Mario Frank 2017-02-08 14:17:51 UTC
Git commit a1f67531b3269941df5ff531baa3e487cf58f1fc by Mario Frank.
Committed on 08/02/2017 at 14:09.
Pushed by mfrank into branch 'master'.

Merged the garbage collection into master. The garbage collector is a maintenance stage that runs before the rebuild of thumbnails and must be triggered explicitely.
It removes stale image entries in core db and if enabled also stale thumbnails and face identities from thumbnails and recognition DB.
If configured so, the core DB part of the garbage collector removes stale image entries in core db during the start of digiKam.
Note that cleaning the databases does not necessarily make them smaller as no auto-vacuum is done on the databases. The vacuuming proces differs highly between the three
supported database variants (SQLite, internal MySQL and external MySQL). Thus, currently there is no automatism.
Related: bug 323718
FIXED-IN: 5.5.0

M  +3    -1    NEWS

https://commits.kde.org/digikam/a1f67531b3269941df5ff531baa3e487cf58f1fc