Bug 430975 - Add basic de-duplication like dupeGuru
Summary: Add basic de-duplication like dupeGuru
Status: RESOLVED DUPLICATE of bug 261831
Alias: None
Product: digikam
Classification: Unclassified
Component: Searches-Similarity (show other bugs)
Version: 7.2.0
Platform: Other All
: NOR wishlist (vote)
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-12-30 11:30 UTC by stievenard.david
Modified: 2021-03-30 03:18 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
dupeguru 1 (16.85 KB, image/png)
2020-12-31 05:12 UTC, stievenard.david
Details
dupeguru 2 (16.81 KB, image/png)
2020-12-31 05:13 UTC, stievenard.david
Details

Note You need to log in before you can comment on or make changes to this bug.
Description stievenard.david 2020-12-30 11:30:12 UTC
SUMMARY

I often end up with a lot of duplicates : 
- I have to do "manual" dumps of photos from several iphones/android devices on several laptops when I can and I copy all of that on a NAS
- I reduced the problem on android with 'syncthing' but chat apps like wechat, whatsapp... naturally create duplicates (and some of them wipe any metadata in the process...)
- I never remember from which date I should start from, so, as a conservative approach, I always target those dumps too large in time
- I'm definitively not alone having this problem, I can't count how many times I recommended dupeguru to people that have the exact same problem and are now solving it with success on windows.


For now I solve this problem with dupeguru that has a simple feature : setup a master directory (i.e. my main  "to_sort" directory) and search and delete for full duplicates in other (configurable) directories. It doesn't show me anything, it just delete duplicates in those other directories so I can move all the  remaining pictures in the main "to_sort" directory and start my sorting out workflow.
https://dupeguru.voltaicideas.net/
https://github.com/arsenetar/dupeguru


Why I thing this feature should be in Digikam ?
- as Ubuntu package must follow each versions of the OS the maintainer of dupeguru can not keep up with it : it was not easy to install it on 19.10 and now I can't get it to work on 20.04 : one can say that I should come back on windows to avoid that but I rather not leave my dear Kubuntu and all the wonders I discovered there.
- I prefer using one software to handle all picture management needs, and Digikam is THE one that control them all and it's just wonderfull.
- The actual deduplicate function in Digikam is more sofisticated and not matching my need. It seems to be designed to go through your 'final' collection and find photos that are similar or quite similar and propose you to choose. What I need is something to help me erase thousands of duplicates before I gather all of them in my main directory, rename by date and start my classification workflow.


Don't hesitate to contact me if you need any additionnal explanations, screeshots, videos !!!
Comment 1 Maik Qualmann 2020-12-30 12:56:42 UTC
I don't see what dupeGuru can help us with. It is also written in Python. The duplicate search in digKam can solve this in the same way. How would I proceed? Create a scrap album and copy all images into it from all devices. Update the fingerprints in the duplicate search, select the scrap album as the only album. Set a range of perhaps 90/98-100% and let it search. Check and press the Remove duplicate button - done. You could add the reference albums and do a search again.

Maik
Comment 2 stievenard.david 2020-12-31 05:11:48 UTC
Hi, thank you for taking the time to read me and explain.

- "I don't see what dupeGuru can help us with. It is also written in Python"

I used Dupeguru as an example of design that is simple and working for this need (even for non experimented users). I'll attach screenshot of it that might be more clear then my explanations.


- "Create a scrap album and copy all images into it // Update // select the scrap album as the only album // Check and press the Remove duplicate 

I don't have duplicates in my scrape album alone, I have duplicages in my scrape album compared to my existing collection of already cleaned albums. I need to compare what's in my scrape album against my collection and delete any duplicates in the scrape directory in one shot, no check no regrets.
In other words being able to do a search and designate that some 'master' albums should not be touched and delete anything that is a duplicate and not in the master albums


- "You could add the reference albums and do a search again."

I've checked again there's indeed a dropdown menu 'restriction' with the choices 'none' 'restricted to reference album' 'exclude reference album' it could actually be the answer to this need but I checked the documentation and in digikam  but still can't figure out :
- what does it do ? - Restrict what ?
- How to designate one or a group of reference album(s) ?
Comment 3 stievenard.david 2020-12-31 05:12:36 UTC
Created attachment 134411 [details]
dupeguru 1
Comment 4 stievenard.david 2020-12-31 05:13:39 UTC
Created attachment 134412 [details]
dupeguru 2
Comment 5 Maik Qualmann 2020-12-31 07:22:15 UTC
The setting of reference albums is not implemented. We already have a bug report for this. The current selection then refers to the reference item found and restricts it to its album.
But you should also be able to solve your task with the current duplicate search in digiKam, because you may have a image in your new folder that is better than in your reference albums.

Maik
Comment 6 stievenard.david 2021-01-04 10:09:34 UTC
my scrap album has 25000+ pictures and growing
I can use Digikam to do surgical deduplication, but this much pictures I just need to purge the scrap directory automatically.

Is there anything I can do to help raise interest in this feature request ?
Comment 7 stievenard.david 2021-01-07 08:39:11 UTC
I saved my problem for now with a docker of dupeguru : that is definitively not end user friendly but it works

https://github.com/jlesage/docker-dupeguru
Comment 8 stievenard.david 2021-03-30 03:18:00 UTC

*** This bug has been marked as a duplicate of bug 261831 ***