496238 – Similarity search engine will not find effectively identical images if minor variance exists (i.e. contrast)

Bug 496238 - Similarity search engine will not find effectively identical images if minor variance exists (i.e. contrast)

Summary: Similarity search engine will not find effectively identical images if minor ...

Status:	REPORTED

Alias:	None

Product:	digikam
Classification:	Applications
Component:	Searches-Similarity (other bugs)
Version First Reported In:	8.5.0
Platform:	Microsoft Windows Microsoft Windows

Importance:	NOR normal
Target Milestone:	---
Assignee:	Digikam Developers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-11-13 21:57 UTC by Roland
Modified:	2025-04-11 18:13 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Roland 2024-11-13 21:57:12 UTC

***
If you're not sure this is actually a bug, instead post about it at https://discuss.kde.org

If you're reporting a crash, attach a backtrace with debug symbols; see https://community.kde.org/Guidelines_and_HOWTOs/Debugging/How_to_create_useful_crash_reports

Please remove this comment after reading and before submitting - thanks!
***

SUMMARY
In many cases, I have multiple instances of a photo in an album, where perhaps the content was merged with one set preprocessed (i.e. converted to another format, or auto contrast adjusted etc) where the similarity detection engine misses these- even with the range set to some absurdly low value (50%)

STEPS TO REPRODUCE
1. As above, just load several variant of an image- maybe a few pixels crop difference, or a contrast change. Not sure of which things the engine is most sensitive to. 
2. Select on, and right click select 'find similar'


OBSERVED RESULT
System may return a subset of the 'duplicate' variants, or all, or none, again depending on whatever variable the system seems to care about most. Photos that are very easily detected via Image Dedup etc are missed. 

EXPECTED RESULT
While I dont have any expectation that the similarity engine will detect things like duplicate images that are mirrored or rotated, I would hope it could detect duplicate images that, for example, have one as original color, and another identical geometrically but with a different compression type. And I would certainly expect that it could catch much more minor variations- just a small % change in contrast or brightness. And where it seems to catch some, there are instances where I have 3-4 variants of the same image which appear to a human as near identical that the system misses, where conventional dedup software like AllDup will find rapidly, and there, something like AllDup doesnt even have the advantage of starting with a known source- it is doing an all-to-all comparison, which is so CPU intensive that the idea of perhaps checking via multiple hash routines would not be feasible for a simple 'find similar' routine for a single photo...

SOFTWARE/OS VERSIONS
Windows: 
macOS: 
(available in the Info Center app, or by running `kinfo` in a terminal window)
Linux/KDE Plasma: 
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION
I understand that the position of the devs here may be that this is for true duplicate removal, where the photos are identical but have different names, or where one is raw and another is in a lossless compression conversion. But the reality is that any type of merge of legacy content may bring in stuff that is fundamentally identical to the eye. If the point of a similarity range is to permit variance in color or crop or contrast etc, it seems to not be nearly as effective as it should be. 

I wonder if it might make sense to use several hash types for the initial fingerprinting, where open source modules for pHash, dHash, aHash etc are out there, and where perhaps enabling a search area as full vs center (cropping would be less impactful) could then be user selectable? Since this is a one-to-many test vs a many-to-many test, it would remain quick for the user, but it would be much more capable (read: miss far fewer of what most of us would consider duplicate images).

Comment 1 Maik Qualmann 2024-11-14 06:53:36 UTC

I can't reproduce the problem. Are you aware that when you save a modified image, a similarity fingerprint is not created immediately?
And you have to update the fingerprints?

Maik

Comment 2 Roland 2024-11-14 09:41:42 UTC

Maik-
Yes, Im aware of both of these conditions. 

I believe it is possible my database has invalid flags. That is, it seems the rebuild thumbnails request is skimming over files as already having fingerprints- where those files were problematic (folders with known duplicates were skimmed over per the progress dialog). I copied a folder with this problem to another folder, with a rename, and called a rescan, and now those (copied) files are responding to a deduplication and/or similarity check.

Ill do some additional cross-checking. 

Best
Rob

Comment 3 Maik Qualmann 2024-11-14 19:17:36 UTC

Can you provide 2 sample images where no similarity can be detected even at 50%? If not public, send it to my private email.

Maik

Comment 4 Maik Qualmann 2024-11-16 17:00:43 UTC

Thanks for the sample images. The main problem is the cropping of the two images, which is different. Fingerprints are not an AI function that recognizes the content of the image. Yes, you have to go down to about 45% depending on which image you are comparing. But even in my collection of 50,000 images, only these two images are shown as similar. I don't see any bug in this.

Maik

Comment 5 caulier.gilles 2025-04-11 18:13:40 UTC

Hi,

The 8.7.0 pre-release Windows installer from today have been rebuilt from
scratch with Qt 6.8.3, KDE 6.12, OpenCV 4.11 + CUDA support, Exiv2 0.28.5, ExifTool 13.27, ffmpeg 7, all image codecs updated to last version (jxl, avif, heif, aom, etc.).

Please try with this version to see if your problem still reproducible...

https://files.kde.org/digikam/

Thanks in advance
Best regards

Gilles Caulier