Bug 262452 - duplicate uniqueHash (image hash) in database, wrong thumb on images
Summary: duplicate uniqueHash (image hash) in database, wrong thumb on images
Status: RESOLVED FIXED
Alias: None
Product: digikam
Classification: Applications
Component: Database-Thumbs (show other bugs)
Version: 1.7.0
Platform: Ubuntu Linux
: NOR normal
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-01-07 21:29 UTC by Elle Stone
Modified: 2022-01-12 13:16 UTC (History)
2 users (show)

See Also:
Latest Commit:
Version Fixed In: 7.5.0


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Elle Stone 2011-01-07 21:29:32 UTC
Version:           1.7.0 (using KDE 4.4.5) 
OS:                Linux

One raw file processed multiple times by ufraw, output as tifs with different names. These images are very different renditions visually. Also they all have different md5sums when running md5sum at the command line. 

In the digikam database, most of the renditions have the wrong thumb. So I created a test database with only 8 images, 2 raw files, one tiff from one of the raw files, several tiffs (visually very different renditions from each other) from the other raw file, and one jpeg from the raw file (probably not produced by ufraw). In digikam4.db there are 8 entries in the Images table, 5 of which have the same uniqueHash. in thumbnails-digikam.db there are only 4 thumbs.

Right-clicking on the thumbs and selecting "edit" does open the correct image file, as does opening the preview.

So I used ufraw to produce 3 tifs and 2 jpegs from the other raw file. The jpegs got different uniqueHashes, the tifs all share the same uniqueHash, giving me 13 images in the database, and only 7 uniqueHashes.

Reproducible: Always

Steps to Reproduce:
Put a raw file into a directory. Open the raw file with ufraw. produce a tif. do it a couple more times, make the images look wildly different, so there is no question that the images are not the same. Save each time under a different name. Then open digikam and rescan the directory (or import a new collection if a different root). 

Actual Results:  
Use SQLite database browser to inspect the digikam data and thumbs databases. You'll see an entry in the images table for each tif, but they'll all share the same uniqueHash. Initially the images may or may not have different thumbs, but play around, the thumbs will collapse, so that all the images with the same uniqueHash now have the same thumb. 

Expected Results:  
I'd expect each tif-rendition/version of the original raw file, saved under different names, would have truly unique uniqueHashes, and would have their own correct thumbs.

jpegs from ufraw don't seem to have this problem. I haven't checked other tif-producing software (but I will). Using exiftool to inspect a couple of the ufraw-produced tifs, it looks like ufraw 0.16 copies all the raw file data over to the tiff, so all the metadata in the two images looks (upon quick glance) to be identical. If uniqueHash is depending on metadata to generate uniqueHashes, then that could be the source of the problem.

As md5 of itself is subject to hash collisions, it seems to me that in a large image database, using only a part of the image to calculate md5 hashes is not such a good idea, even apart from the current issue. As already stated, the actual md5 hashes of the images, as calculated by md5sum at the command line, are all different. (Probably a move to sha1 (over the whole image) would be overkill. And probably I don't know enough about hashes to even make these statements.)
Comment 1 Marcel Wiesweg 2011-01-07 23:39:33 UTC
Thanks a lot for your research, indeed this is a problem, known and solved (for the future).

1) This happens usually with TIFF images without metadata. The header of such files contains several kilobytes of (pretty useless) line offsets. I have not seen a JPEG which is affected

2) Computing the hash over the whole file is a major performance problem - scanning would take much longer. The old hash covered 99.9% of cases, we'll see what the new algorithm brings.

3) Some other problems in context of renaming are probably unrelated

*** This bug has been marked as a duplicate of bug 210353 ***
Comment 2 Elle Stone 2011-01-08 00:03:42 UTC
Hi Marcel,

Regarding, "This is a problem, known and solved (for the future). 1)
This happens usually with TIFF images without metadata."

In fact the affected images, tiffs output by UFRaw 0.16 and 0.17, have
a LOT of metadata, all the metadata that was in the raw file (.cr2).
If one were to use exiftool to add eg copyright information, keywords,
contact information, location, etc.to one's raw files (which I do, in
fact) there could be a whole lot of metadata in a raw file.

Suspecting that a wealth of metadata could be the problem, I used
exiftool to strip out all the metadata in the UFRaw-produced tiffs,
and when I added the stripped tiffs to the digikam database, the
stripped tiffs all had unique hashes and proper thumbs.

Is the future solved bug version of digikam available somewhere?

Elle Stone

On 1/7/11, Marcel Wiesweg <marcel.wiesweg@gmx.de> wrote:
> https://bugs.kde.org/show_bug.cgi?id=262452
>
>
> Marcel Wiesweg <marcel.wiesweg@gmx.de> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|UNCONFIRMED                 |RESOLVED
>          Resolution|                            |DUPLICATE
>
>
>
>
> --- Comment #1 from Marcel Wiesweg <marcel wiesweg gmx de>  2011-01-07
> 23:39:33 ---
> Thanks a lot for your research, indeed this is a problem, known and solved
> (for
> the future).
>
> 1) This happens usually with TIFF images without metadata. The header of
> such
> files contains several kilobytes of (pretty useless) line offsets. I have
> not
> seen a JPEG which is affected
>
> 2) Computing the hash over the whole file is a major performance problem -
> scanning would take much longer. The old hash covered 99.9% of cases, we'll
> see
> what the new algorithm brings.
>
> 3) Some other problems in context of renaming are probably unrelated
>
> *** This bug has been marked as a duplicate of bug 210353 ***
>
> --
> Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
> You reported the bug.
>
Comment 3 caulier.gilles 2011-01-08 10:16:59 UTC
Elle, 

Because Marcel work current on Google Summer of Code 2010 branch, i think it's fixed to 2.0.0

Gilles
Comment 4 Elle Stone 2011-01-08 13:21:35 UTC
Gilles, thanks. Can 2.0.0 be run alongside rather than in place of
current digikam?

Elle

On 1/8/11, Gilles Caulier <caulier.gilles@gmail.com> wrote:
> https://bugs.kde.org/show_bug.cgi?id=262452
>
>
> Gilles Caulier <caulier.gilles@gmail.com> changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |caulier.gilles@gmail.com
>
>
>
>
> --- Comment #3 from Gilles Caulier <caulier gilles gmail com>  2011-01-08
> 10:16:59 ---
> Elle,
>
> Because Marcel work current on Google Summer of Code 2010 branch, i think
> it's
> fixed to 2.0.0
>
> Gilles
>
> --
> Configure bugmail: https://bugs.kde.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
> You reported the bug.
>
Comment 5 Marcel Wiesweg 2011-01-08 15:42:53 UTC
1.x does not know the new hash, so it will not open the database once you converted it to use the new hash with 2.0. You need to convert explicitly for this reason, there is an Update button at the bottom of the Database panel in the Settings dialog. Without this conversion, both version can operate on the same db, but your problem is not fixed.
Comment 6 caulier.gilles 2022-01-12 13:16:05 UTC
Fixed with https://bugs.kde.org/show_bug.cgi?id=210353