Bug 375573 - Don't reset/destroy context after deleting one image among a set of duplicates
Summary: Don't reset/destroy context after deleting one image among a set of duplicates
Status: RESOLVED FIXED
Alias: None
Product: digikam
Classification: Applications
Component: Searches-Similarity (show other bugs)
Version: 5.5.0
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-26 05:12 UTC by Dan Dascalescu
Modified: 2017-02-04 15:48 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In: 5.5.0


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dan Dascalescu 2017-01-26 05:12:09 UTC
I have a collection with many similar images (~20 takes per shot) and the goal is to quickly go through duplicates and delete all but one of the images.

After I click Find Duplicates, I click on "Ref. images", then double click on the first thumbnail to go into the Preview. As I'm navigating with left/right arrow through the duplicates, I press Shift+Delete to delete one of them that's clearly worse than the ones I've seen so far. I would like to repeat this process until I end up with only one image.

The problem is that after I delete an image, digiKam unhelpfully displays "Failed to load image" and kicks me out of that duplicate set. The keyboard focus is also lost.

What would make a lot more sense is to let me navigate with the arrows through the other images in the set of duplicates, and keep deleting them.

I've tried Alt+3 to set flags, and while this is a workaround, it's unnecessary, I think. I don't see a good reason to take the user out of the flow of deleting duplicates in a set after they've deleted the first one.
Comment 1 Mario Frank 2017-01-26 09:17:19 UTC
Hey Dan,

there was a bug before 5.4 with a quite long discussion ( https://bugs.kde.org/show_bug.cgi?id=261417 ). To make it short:
When some image from a duplicates album is deleted, the count of duplicates for this album has to be adjusted. Otherwise, we provide wrong information. Also, the deleted image may be member of other duplicates albums. Thus, they have to be adjusted, too. Some of the albums may even vanish if this was the only duplicate to the reference image.

Following this, I took the most performant approach: all duplicates albums that contained the image are rescanned for duplicates and followingly refreshed. This may take some time depending on images involved. During this time, the image view loses the connection to the duplicates album since it is not present during rescan but only afterwards.

So, what you experience is the lost connection.

I agree that the workflow is interrupted in this case. If only one duplicates album needs to be adjusted, trying to just decrement the image count would be feasible. But as soon as another duplicates album becomes dirty by the deletion,
a rescan should be definitely done, I think.
Delaying the rescan would technically be possible. the problem here is that we cannot estimate the usual time a user should have until a rescan is done.
If a duplicates album has 100 items and you delete one image per second, the delay is okay. But 10 seconds delay, for example may again interrupt the workflow of users.

Any comments/opinions to this?
Comment 2 Dan Dascalescu 2017-01-27 04:55:57 UTC
Hey Mario,

Thank you for the explanation. I understand the tradeoff - accuracy in reporting the number of dupes, vs. speedy processing. The solution I propose revolved around lazy calculation - does the user care more about a precise number shown next to the album *when they get to see it*, or to be able to move on to examine the other duplicates in the cluster?

I mentioned "when they get to see it" because after the user deletes one of the duplicates, the list of duplicate clusters in the left pane always scrolls to the top (IMO this could be improved to try to keep the scroll position, but digiKam probably just re-sorts the list), so if they were working on a duplicate cluster below the fold (i.e. if they have scrolled down at all), the number of duplicates in that album won't be visible anyway. In fact, when you deal with many clusters of duplicates, only those items at the top, according to the sort order (Ref. images filename, # of items, or Avg. similarity) will be visible.

Not sure what you meant by "one duplicates album" (needs to be adjusted) - did you mean a cluster (in DUFF terminology, http://duff.dreda.org/) of duplicates (which may be spread across different albums), or an album that contains duplicates, so the count of items in the album needs to be adjusted? In the latter case, that count is even farther from the user's attention, because the user is in the Fuzzy tab, vs. in the Albums tab. Could the recalculation of counts be done only once, when the user leaves the Fuzzy tab?

Also, there are two different scenarios I see when it comes to deleting duplicates:

1) Deleting images in duplicate clusters one by one, while the user looks at the picture in Preview Mode, to examine it in as large of a size as possible. In this case, only one image is deleted at a time. Would counts be easier to decrement in this case?

2) Staying in Thumbnails or Table, selecting multiple images, and deleting them at once.

Finally, question about "the deleted image may be member of other duplicates albums" (this relates to the cluster vs. album distinction) - is the duplicate relationship transitive? I mean, if images A and B are dupes within the similarity range, and B is part of another cluster of duplicates, A should be part of that cluster too, which means only two counts need to be updates: the number of dupes in that cluster, and the number of items in the album the image belongs to.
Comment 3 Mario Frank 2017-01-27 07:03:39 UTC
Hey Dan,

I will answer inline since there are some things that came me in mind.

(In reply to Dan Dascalescu from comment #2)
> Hey Mario,
> 
> Thank you for the explanation. I understand the tradeoff - accuracy in
> reporting the number of dupes, vs. speedy processing. The solution I propose
> revolved around lazy calculation - does the user care more about a precise
> number shown next to the album *when they get to see it*, or to be able to
> move on to examine the other duplicates in the cluster?

I would expect the latter to be more important than the accuracy. Thus,
delaying is an option for me.

> 
> I mentioned "when they get to see it" because after the user deletes one of
> the duplicates, the list of duplicate clusters in the left pane always
> scrolls to the top (IMO this could be improved to try to keep the scroll
> position, but digiKam probably just re-sorts the list), so if they were
> working on a duplicate cluster below the fold (i.e. if they have scrolled
> down at all), the number of duplicates in that album won't be visible
> anyway. In fact, when you deal with many clusters of duplicates, only those
> items at the top, according to the sort order (Ref. images filename, # of
> items, or Avg. similarity) will be visible.

Okay, let's switch to your terminus. With duplicates albums, we refer to
what you call duplicates clusters (internally called search albums), i.e.
the entries in the left table - one duplicates album is one entry here.
Scrolling to the top is really annoying. This could be resolved. 
But I will come to that later.

> 
> Not sure what you meant by "one duplicates album" (needs to be adjusted) -
> did you mean a cluster (in DUFF terminology, http://duff.dreda.org/) of
> duplicates (which may be spread across different albums), or an album that
> contains duplicates, so the count of items in the album needs to be
> adjusted? In the latter case, that count is even farther from the user's
> attention, because the user is in the Fuzzy tab, vs. in the Albums tab.
> Could the recalculation of counts be done only once, when the user leaves
> the Fuzzy tab?
> 
> Also, there are two different scenarios I see when it comes to deleting
> duplicates:
> 
> 1) Deleting images in duplicate clusters one by one, while the user looks at
> the picture in Preview Mode, to examine it in as large of a size as
> possible. In this case, only one image is deleted at a time. Would counts be
> easier to decrement in this case?

Yes, this was my first approach when I tried to fix the referenced bug.
But the fact that the image should also vanish from other duplicates clusters
would have forced me to decrement there, too. But the count of images is defined
in the internal search albums in the way that the count is the count of image ids.
And the cluster list does not know how many of the images are existent.
Nevertheless, it is technically possible to get the cluster list to know which images
still exist and which do not. But then again, the average similarity is not correct
anymore as it is calculated on the complete set of images.
This could be also solved by the fact that I introduced the similarities between images
in database shortly before release of 5.4.

> 
> 2) Staying in Thumbnails or Table, selecting multiple images, and deleting
> them at once.
> 
> Finally, question about "the deleted image may be member of other duplicates
> albums" (this relates to the cluster vs. album distinction) - is the
> duplicate relationship transitive? I mean, if images A and B are dupes
> within the similarity range, and B is part of another cluster of duplicates,
> A should be part of that cluster too, which means only two counts need to be
> updates: the number of dupes in that cluster, and the number of items in the
> album the image belongs to.

Theoretically, you are right. If image A is a duplicate of reference images
B and C, the images B and C have *some* similarity, too. But as in audio streams -
if stream a is part of stream b and c, the latter streams have *some* similarity
in *some* position. Perhaps the similar parts are only 2 %. Depending on the given
similarity range, this similarity is ignored. We cannot use transitive closures here.

So, to roll up.
If we have duplicates cluster A and we delete some image that is also part of duplicates
cluster B, we need to update both clusters - in some way: rescanning/decrementing counts.
If we delete the reference image of cluster A itself, the cluster would currently vanish.
As consequence, the internal search album is removed and you lose context. This is a problem
which was not addressed in the referenced bug. And this is a real disturbance in the workflow.

I would thus propose the following: the removal of an image in some duplicates album should
signal the list of duplicates clusters to update. The count of images in clusters is recalculated
by getting the information which images still exist. At the same time, the new average similarity
is calculated with the similarities of the remaining images to the reference image.
All duplicates clusters which only contain one image are removed from the list as they are not relevant
anymore. This all should be technically quite easy to implement until the release of 5.5.

What do the other devs think?

If this is confirmed, I would do that after I am finished with my small garbage collection project.
Comment 4 caulier.gilles 2017-02-04 14:10:46 UTC
Mario,

I read your proposal from comment #3 and it sound fine for me.

Gilles
Comment 5 Mario Frank 2017-02-04 15:48:34 UTC
Git commit 7ceca1f172828e48b47c5088b61b2452b7820e52 by Mario Frank.
Committed on 04/02/2017 at 15:47.
Pushed by mfrank into branch 'master'.

We do not rescan for duplicates if an image is deleted any more.
Instead, all duplicates albums in left pane are updated, i.e. the items count,
and average similarity are recalculated. If only one duplicate is left,
the duplicates album is hidden. This solves the problem of losing context
due to the rebuild of the SAlbums. I see no other good technical
possibility of preserving the context since the SAlbums are deleted automatically.
Also, the similarities to images are not deleted any more. Otherwise the calculation
of the average similarity would be wrong. We will take care of the similarity values
in garbage collection branch.
FIXED-IN: 5.5.0

M  +2    -1    NEWS
M  +9    -20   libs/album/albummanager.cpp
M  +1    -1    libs/album/albummanager.h
M  +17   -0    libs/database/item/imageinfo.cpp
M  +5    -0    libs/database/item/imageinfo.h
M  +18   -0    utilities/fuzzysearch/findduplicatesalbum.cpp
M  +4    -0    utilities/fuzzysearch/findduplicatesalbum.h
M  +65   -16   utilities/fuzzysearch/findduplicatesalbumitem.cpp
M  +10   -0    utilities/fuzzysearch/findduplicatesalbumitem.h
M  +4    -13   utilities/fuzzysearch/findduplicatesview.cpp
M  +1    -1    utilities/fuzzysearch/findduplicatesview.h

https://commits.kde.org/digikam/7ceca1f172828e48b47c5088b61b2452b7820e52