Bug 319001

Summary: Smart detection whether file was been already downloaded
Product: [Applications] digikam Reporter: Cristian Klein <cristiklein>
Component: Import-Gphoto2Assignee: Digikam Developers <digikam-bugs-null>
Status: REPORTED ---    
Severity: wishlist CC: caulier.gilles, kde_org, konrad.kostecki, nicofo, tpr
Priority: NOR    
Version: 7.3.0   
Target Milestone: ---   
Platform: unspecified   
OS: All   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Cristian Klein 2013-04-27 23:33:13 UTC
By using digiKam 2.5.0 on Ubuntu 12.10 and reading the latest GIT source code, I deduce that, when importing photos from a UMS, digiKam uses the download history to display whether a file has already been downloaded or not. This is somewhat inconvenient, in several cases:

(1) A user just started using digiKam: all photos are shown as "new".
(2) A user changed the way she downloads picture, e.g., she switched from PTP to using a card reader.
(3) A user might import photos through several ways, e.g., sometimes directly from the camera, sometimes from a "holiday" laptop.

Wouldn't it be possible for digiKam to more intelligently detect if photos have already been downloaded?

I was thinking of the following solution. First, for each photo (local or to-be-imported) compute a unique ID by reading the EXIF data. Newer cameras already add a "unique photo ID" EXIF tag. For older cameras, one may compute a unique picture ID using a combination of camera make, camera model and file name. If I understood correctly, this would also conform to digiKam's philosophy of using its database only to accelerate operations, without storing any data that could not be found in the files themselves.

Any thoughts on this? If this makes sense, I could try dedicating some time to develop the feature myself.

Reproducible: Always

Steps to Reproduce:
(For example)
1. Start digiKam
2. Import some photos from a USB mass storage device
3. Exit digiKam
4. Delete its database (but not the photos it has just imported).
5. Start digiKam
6. Open the import window for the same USB mass storage device
Actual Results:  
All photos are marked as new

Expected Results:  
The photos that have been downloaded at step 2 should be detected as already downloaded.
Comment 1 Marcel Wiesweg 2013-04-28 16:46:36 UTC
The problem you encounter sooner or later (with gphoto cameras sooner than with UMS cameras) is that the time you need to compute the hash, by accessing the Exif data, will be disproportional to the gained functionality. Regarding the use of make, model and name, let's have a look at the DownloadHistory database header file:
    /**
     * Queries the status of a download item that is uniquely described by the four parameters.
     * The identifier is recommended to be an MD5 hash of properties describing the camera,
     * if available, and the directory path (though you are free to use all four parameters as you want)
     */
    static Status status(const QString& identifier, const QString& name,
                         qlonglong fileSize, const QDateTime& date);

For me all points are very minor problems, yes we could make wild guesses that pictures on the camera were already downloaded based on some parameters, yet file name is not useful as there can be renames, file size is not useful as metadata can have been edited, date alone is by far too weak.
Comment 2 Cristian Klein 2013-04-28 18:35:13 UTC
Hi Marcel,

Let me address your comments inline.

On 2013-04-28 18:46, Marcel Wiesweg wrote:
> The problem you encounter sooner or later (with gphoto cameras sooner than with
> UMS cameras) is that the time you need to compute the hash, by accessing the
> Exif data, will be disproportional to the gained functionality.

I'm not sure I agree with this. When importing photos through UMS, the
user is presented with a preview of each photo, so very likely the EXIF
tag is already read in by digiKam. Even if the EXIF tag is for some
reason not read by digiKam (e.g., using seek), the kernel will cache
whole disk blocks (usually 4KB in size), therefore, reading the EXIF tag
would have a minimum performance impact. I have already presented
several use-cases when smart "already-downloaded" detection would help,
so I don't find the cost disproportional.

I'm not sure what would be the performance impact for gphoto cameras.
Isn't the EXIF metadata read in anyway as part of image preview?

> Regarding the
> use of make, model and name, let's have a look at the DownloadHistory database
> header file:
>     /**
>      * Queries the status of a download item that is uniquely described by the
> four parameters.
>      * The identifier is recommended to be an MD5 hash of properties describing
> the camera,
>      * if available, and the directory path (though you are free to use all
> four parameters as you want)
>      */
>     static Status status(const QString& identifier, const QString& name,
>                          qlonglong fileSize, const QDateTime& date);

For UMS, "identifier" depends on the media ID and not on the photo
metadata. Therefore, if I receive the same photo through two source,
DownloadHistory will mark the photo incorrectly as
not-previously-downloaded. For me, this is cumbersome.

> For me all points are very minor problems, yes we could make wild guesses that
> pictures on the camera were already downloaded based on some parameters, yet
> file name is not useful as there can be renames, file size is not useful as
> metadata can have been edited, date alone is by far too weak.

I agree that for legacy cameras, this might be difficult. However, like
I wrote, newer cameras include a "unique photo ID" (something like a
UUID) in the EXIF tags of each photo. Users might already have access to
such cameras (I do), why not take advantage of it?
Comment 3 Teemu Rytilahti 2013-12-30 00:03:45 UTC
I think EXIF is already read for all the photos, at least partially at some point, so this could be possible. If there's wide support for this, we could use that as a hash and fallback to our current calculation. Nevertheless I wasn't able to find any photos from my collection having anything this unique, do you have some samples?
Comment 4 Maik Qualmann 2019-10-15 20:31:11 UTC
*** Bug 412999 has been marked as a duplicate of this bug. ***
Comment 5 caulier.gilles 2020-08-04 16:42:07 UTC
digiKam 7.0.0 stable release is now published:

https://www.digikam.org/news/2020-07-19-7.0.0_release_announcement/

We need a fresh feedback on this file using this version.

Best Regards

Gilles Caulier
Comment 6 Maik Qualmann 2021-04-13 11:02:33 UTC
*** Bug 435680 has been marked as a duplicate of this bug. ***
Comment 7 Kokos 2021-04-13 11:10:51 UTC
(In reply to caulier.gilles from comment #5)
> We need a fresh feedback on this file using this version.

I'd vote for still very nice to have. Problem described by Cristian in 2013 is still valid.

I'll also allow myself to copy my input from duplicated ticket:

> There is already a setting in camera/import behaviour section to skip/replace/create_copy files in case they already exist in target location. Can we do this more reliable to classify file as already existing basing not only on its filename but also other attributes like at least file size?

> I can imagine situation that you reset digikam, clean home directory or reinstall operating system and with fresh digikam instance you perform import of SD card which contain files already downloaded. Fingerprint history is empty, but files are already in target location. By default digikam import just creates duplicates with a different name. In case, just before downloading, it recognizes that filename and sizes are identical it could suggest skipping those files if there would be an option for that. I think it would be handy. I'm not talking here it should be a default setting, definitely not, but in some cases it could be very helpful.
Comment 8 kde_org@reepie.nl 2025-01-02 01:35:16 UTC
My take on this: the `import images` dialog indicates that `This item has never been downloaded` while the specified image file is known (e.g. in an album).
I have noticed that the `DownloadHistory` table only contains a fraction (approx. 900) of the number of actual images (approx. 120K). This album was added as a collection from removable media. 
The download option is therefore confusing as after processing all duplicate images are still indicated as `.. never been downloaded`. As the feedback is also vague (the progress window is quickly removed) it is unclear what actually has been done.

I would expect that all those images would be known as being `downloaded` at the start but I surely would expect it to be indicated _after_ processing. The files _are_ identical with name, size, date so I see no reason why those could not be added to the `DownloadHistory`. 
Because of this the option to `Download new` will always try to download everything, when there are duplicates offered, which can be time consuming and waste of time. 

v8.5.0 on MacOS 15.2