Bug 369051 - Too low similarity threshold in fuzzy/duplicate search bloats the results with potentially unwished high-similarity results [patch]
Summary: Too low similarity threshold in fuzzy/duplicate search bloats the results wit...
Status: RESOLVED FIXED
Alias: None
Product: digikam
Classification: Applications
Component: Searches-Similarity (show other bugs)
Version: 5.1.0
Platform: Arch Linux Linux
: NOR wishlist
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-19 14:26 UTC by Mario Frank
Modified: 2016-11-17 11:34 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In: 5.4.0
Sentry Crash Report:


Attachments
The patch for introducing a similarity interval (35.73 KB, patch)
2016-09-19 14:28 UTC, Mario Frank
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mario Frank 2016-09-19 14:26:27 UTC
When having many pictures, including variants of one picture with different quality, e.g. due to resizing, conversion and Collage creation, the lower-quality pictures may be found only with low similarity threshold (e.g. 45 %). But the result set will contain all pictures with a similarity between 45 % and 100 %. This can make the search for low-quality variants frustrating. Having the possibility to specify the maximum similarity may solve the problem.

Reproducible: Always

Steps to Reproduce:
1.Have many series pictures you want to keep and some lower-quality variants you want to get rid off.
2. Start a duplicate search with, let's say 40 %

Actual Results:  
You will get all pictures with a similarity above 40 %

Expected Results:  
It is designed to do that. But having an option to specify a maximum similarity could be more convenient.

I implemented and tested that. Also, I can provide a patch file against the master branch.
Here is the local commit message describing the implementation:
"Extended the findduplicatesview and fuzzysearchview with an
 additional QSpinBox which denotes the maximum similarity. The new QSpinBox
 has a minimum value that is the current value of the minimal similarity
 threshold. When the minimum threshold is altered, the range of the new
 QSpinBox is updated. If the minimum threshold is increased beyond the current
 value of the new QSpinBox, the value of the new QSpinBox is increased
 automatically. In the fuzzysearchview, altering the maximum similarity also
 triggers the reuild of the similar images album. The extension can be highly
 valuable if you knowingly want to ignore almost identical images but want to
 find images that have a similarity of, let's say 50-60%, due to resizing,
 cropping or something similar, without bloating your image pane."
Comment 1 Mario Frank 2016-09-19 14:28:27 UTC
Created attachment 101176 [details]
The patch for introducing a similarity interval
Comment 2 caulier.gilles 2016-11-04 17:15:45 UTC
Mario,

The patch is very interesting and well implemented.

I plan to introduce your code after 5.3.0.

Q : currently, the icon view of fuzzy searches result is not filter by average order. All items found are mixed. It can be a good idea to sort item in this view, this will increase the usability. Your viewpoint ?

Best

Gilles Caulier
Comment 3 Mario Frank 2016-11-07 09:00:23 UTC
Hey Gilles,
those are good news. I agree with you concerning the improved usability by ordering the, as I understand, list of results in the left pane where the reference image and count of similar images is shown.
But introducing an order here means changing the signature of the functions in haariface. Since QMap automatically has a sorting on the keys, we could use this to introduce an order to the result set. One quite easy way would be to wrap the QMap<qlonglong,QList<qlonglong>> as value of a avg-similarity-map. This would surely increase the memory consumption during search. But the automatic ordering by the similarity would circumvent a signifficant increase of runtime.
After a small glimpse at the source code with grep, I found no possible conflicts with other files concerning the definition of the result set. Changing the return value types in haariface should be most likely safe. Should I propose another patch for this issue?
Comment 4 caulier.gilles 2016-11-07 10:43:06 UTC
yes one another patch to one another report please.
Thanks in advance

Gilles
Comment 5 caulier.gilles 2016-11-10 04:52:39 UTC
Git commit afe577f0b297a343ab412ce95c1f75303edfb18b by Gilles Caulier.
Committed on 10/11/2016 at 04:48.
Pushed by cgilles into branch 'master'.

Apply big patch #101176 from Mario Frank

This one extended the findduplicatesview and fuzzysearchview with an
additional QSpinBox which denotes the maximum similarity. The new QSpinBox
has a minimum value that is the current value of the minimal similarity
threshold. When the minimum threshold is altered, the range of the new
QSpinBox is updated. If the minimum threshold is increased beyond the current
value of the new QSpinBox, the value of the new QSpinBox is increased
automatically. In the fuzzysearchview, altering the maximum similarity also
triggers the reuild of the similar images album. The extension can be highly
valuable if you knowingly want to ignore almost identical images but want to
find images that have a similarity of, let's say 50-60%, due to resizing,
cropping or something similar, without bloating your image pane.
FIXED-IN: 5.4.0
CCMAIL: frank@uni-potsdam.de

M  +2    -0    app/utils/searchmodificationhelper.cpp
M  +1    -0    app/utils/searchmodificationhelper.h
M  +4    -3    libs/database/dbjobs/dbjob.cpp
M  +16   -5    libs/database/dbjobs/dbjobinfo.cpp
M  +7    -3    libs/database/dbjobs/dbjobinfo.h
M  +27   -16   libs/database/haar/haariface.cpp
M  +9    -8    libs/database/haar/haariface.h
M  +9    -2    libs/database/item/imagelister.cpp
M  +53   -25   utilities/fuzzysearch/findduplicatesview.cpp
M  +1    -0    utilities/fuzzysearch/findduplicatesview.h
M  +58   -11   utilities/fuzzysearch/fuzzysearchview.cpp
M  +2    -1    utilities/fuzzysearch/fuzzysearchview.h
M  +16   -10   utilities/maintenance/duplicatesfinder.cpp
M  +2    -2    utilities/maintenance/duplicatesfinder.h

http://commits.kde.org/digikam/afe577f0b297a343ab412ce95c1f75303edfb18b
Comment 6 caulier.gilles 2016-11-10 05:08:48 UTC
Mario,

Your patch is now applied to current implementation and will be avaialble for next 5.4.0 release.

Next step for me is to review your new patch from bug #372217. Note that your next patch must close certainly bug #302923 (please confirm). 

In parallel, can you check what can be do to improve again duplicate searches tool with:

- bug #261417 : the searches album counter is not updated.
- bug #353331 : typically this one can be certainly closed as we can limit search to a specific physical or virtual album. Please just review to confirm.
- bug #207188 : as i remember, the algorithm to process fingerprints over image take a care about colors contents (else, this will have no sense...). So i"m not sure if this file is valid...
- bug #274360 : i cannot figure why some king of image type are ignored. All image format supported by digiKam will be processed while fingerprints computation and searches.

Again, thanks for your contributions. I appreciate the quality of your patches, which a a pleasure to review.
Comment 7 caulier.gilles 2016-11-10 05:15:12 UTC
>Next step for me is to review your new patch from bug #372217. Note that your
>next patch must close certainly bug #302923 (please confirm).

I respond myself:

your patch from bug #372217 cannot solve bug #302923, because patch is dedicated to sort search albums from left sidebar, not the icon view on the center.

I will appreciate a patch aver icon-view model/view to be able to sort by similarly level. Thanks in advance

Gilles Caulier
Comment 8 Mario Frank 2016-11-10 09:41:55 UTC
Hey Gilles,
Many thanks for the judgement about the quality of my patches.
I will try to fix what I can. Some of the "bugs" do not seem to be hard to fix. Some other could be more complex.
Comment 9 Mario Frank 2016-11-10 09:43:12 UTC
By the way: the CCMAIL is incorrect. The correct one is mario.frank@uni-potsdam.de. If the dot should be a problem, just use mafrank@uni-potsdam.de.
Comment 10 Barbara Scheffner 2016-11-16 05:56:51 UTC
Before I update the doc accordingly: shouldn't the labeling be changed now to "Similarity range" or at least "Thresholds"?
Comment 11 Mario Frank 2016-11-17 11:34:02 UTC
I agree, Wolfgang. Similarity range is a better description here.
Moreover, I just realised that it is not possible to set a range in the maintainance dialog. I will open a new file for both parts and submit a patch.