Bug 159477

Summary: "Robust metadata support", "Exception resistance", "Versioning", "Outside Program Interference"
Product: [Applications] digikam Reporter: Sherwood Botsford <sgbotsford>
Component: Metadata-VersioningAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED FIXED    
Severity: wishlist CC: caulier.gilles
Priority: NOR    
Version: 0.8.2   
Target Milestone: ---   
Platform: Debian stable   
OS: Linux   
Latest Commit: Version Fixed In: 7.1.0
Sentry Crash Report:

Description Sherwood Botsford 2008-03-17 16:50:09 UTC
Version:           0.8.2 (using KDE 4.0.0)
Installed from:    Debian stable Packages
OS:                Linux

While digikam has lots of features, I find myself using it with some fear.
As far as I can tell from the docs, dk stores all of it's data in a sqlite database.
Given the ease which a database can be turned into a random number generator, I want belt and
suspenders.

My wishlist for a  photo database:

0.  I want to store up to a million images.

1.  I want to maintain a folder hierarchy.  I want the app to be able to both manipulate
the physical directory structure of the collection (folders) as well as the virtual struture (albums)
(The inability to manipulate the folder structure led to my discarding digikam the first time I tried it.)

2.  I want to be able to manipulate the folder hierarchy outside of the application.  

3.  I want a view of an album or a folder to optionally include views of subfolders/sub albums.  Eg
I have an album called forests, with subabums called birch forests, aspen forests, alder forests, spruce forests.
When I click forests, I want to optionally be able to see either just the stuff in the top level album, or all the 
stuff in all the subalbums.

4.  I want to be able to filter on any combination of date, time of day, description, keywords, categories, who's in the pic
using and/or/not, and agrep near misses.

5.  I want to be able to pick a photo, edit it in some other program, drop a copy in the folder tree
of the database, and have the database realize that it's a copy, and that most of the meta data is the same.

6.  I want to be able to set the database a task with an existing database that will spend the night groveling through 
it and with reasonable accuracy tell me that picture B is a cropped resized, color adjusted version of picture A.

7.  I want the database to survive corruption.

Whew!

How can this be done?

A.  All data is written at least twice.  One time in the database.  One time in the metadata fields in the
picture itself.  One time in a dot file in the directory where the picture lives.  Not all file formats support
writable meta data -e.g. most raw formats, many simple raster formats.

B.  Part of the metadata written is a unique ID for the image.  For image formats that support metadata
this allows images to move around the file tree.  Digicam should be able to catch this by monitoring ctime
changes.  For non-writable formats, a hash value of the file can be stored both in the database and in the
directory.  Simple raster formats are almost certainly derivative. Not sure how best to deal with these.

C.  When an image is modified, a suffix is added to the ID.  One of the meta data fields can state how it was
modified if done from within digicam. If it was done by an external program, Digikam prompts for information about
a photo or directory of photos when it discovers them.

D.  I've heard rumours of invariants that when run on an image can show similarity/difference.  Essentially the
opposite of a hash function, where the slightest difference gives an entirely different string, invariants give
similar strings for similar but not identical images.  I suspect that creating an invariant for scaling would be
fairly easy, one for colour transforms wouldn't be too hard.  Ones for cropping would be a lot harder. I'm pretty sure
that there is no single invariant that works all the time.  It would take  a bunch to be sure. This would
need to be part of the housekeeping function.  Where digikam knows they are derivative works, it's easy, and indeed
this can be used as a test bed for finding derivatives produced by outside programs. 

E.  The multiple locations of metadata gives robustness.  If the database is completely corrupt, it can rebuild much
of it by scanning the images & directories.   If a picture vanishes from a folder, both the meta data and the folder
data show what used to be there.  This could be matched with images that suddenly appear elsewhere in the directory tree.

F.  Maintaining an internal directory tree of hard links to images also can help track down outside file moves.
If the move was a copy and delete, then the link count goes to 1.  Digikam can do comparisons amoung images that 
have link counts of 1 to establish it's location.  If it was a move, then the internal hard link directory points
to the new location, which disagrees with the database.  Digikam updates the database, and the folder data.
Additional robustness for meta-data capable files could had by storing the current location in the tree in the file itself.
When digicam moves a file, this is automatically updated.  If an outside program moves this file, it still points to the
old location.   The housekeeper looks for inconsistencies. If a file has been moved off the file system it gets trickier.
I would propose that digikam keeps a list of one link files. If digikam can't find an orphaned file, then the single
remaining link is considered 'trash'   Characteristics of the trash file are user selectable but should include:
* Don't empty the trash until I tell you do.
* Keep everything in the trash for N days.
* Keep the last X GB of images in the trash.
* Fuss at me if the trash is getting too full.
Trash appears as a folder in the database.

This is all quite compute intensive.  But I'll point out that anyone who is serious about photography has at least
two cores, and maybe as many as 8 cores working for him.  Keep those other cores busy.  I suspect that digikam
to do all this needs to be separated into three programs:  A front end, a database daemon, and a housekeeping daemon.
Doing it as three programs allows the housekeeper to be reniced to some non-obnoxious value so that even with a 
single core machine it doesn't slow to a crawl.


Examples of projects I've done:
I go on a trip.  I come back with 500 images.
1.  Edit with photoshop.  Produce an edited psd for each image.
2.  Export from photoshop to jpeg.
3.  Create 12 sizes of each image ranging from 2024x3032 down to 64x96.  This becomes
the web page images.

So 500 images has just become 7500 images.  The #3 would probably happen outside the photo library.
Since it's script run, recreating that information is easy.

I go out on the tree farm and take my spring snaps.
1.  I bring back 60 raw format images.
2.  Open each one in photoshop.
3.  Produce one to three cropped images.
4.  Save each cropped photoshop file separately
5.  Batch process the photoshop directory to create a jpeg directory of
all the images in jpg format.
6.  Produces different sizes of jpeg file for use on a web page.

So each image:
Raw format.
Adjusted full size PS format
1-3 cropped PSD formats
2-4 full resolution JPEG derived form PSDs
? resized images from JPegs.
Comment 1 Arnd Baecker 2008-03-17 20:32:48 UTC
Hi Sherwood,

before going into the details: 0.8.2 is outdated by now;
0.9.3 is the current stable version and 0.9.4 will be out soon
(see http://www.digikam.org/?q=about/releaseplan).
Therefore many issues you raise are fortunately already solved.

0.) there should be no problem with that
1.) Digikams albums directly correspond to directories 
    on the hard disk.
    Virtual albums can be created by using Tags.
    (A particular case of virtual albums are the result of
     user defined searches or dates view).
2.) See point 3) in https://bugs.kde.org/show_bug.cgi?id=125736
    Please add any comments there.
3.) Is possible with the current version
4.) Is possible with the current version
5.) Could you please file a separate wish for this one?
6.) Searching for similar pictures will be done for digkkam 0.10,
    This is to some extent also related to 
    https://bugs.kde.org/show_bug.cgi?id=125387
7.) A.) With the current digikam version, all metadata can
    be stored both in the database and in the image
    file (if supported).
    B.) hash values will be used for digikam 0.10
    D.) For digikam 0.10 the Haar measure will be used
        to detect similar images.

I would suggest to close this bug because
too many things are already solved.
Maybe you could try the current version and then, 
if you miss a feature, file separate wishes for each 
of them, because otherwise it is difficult to keep track of 
the corresponding discussion and patches.

Thanks a lot, Arnd
Comment 2 caulier.gilles 2008-03-24 10:55:31 UTC
Sherwood,
 
I'm sorry, but I cannot accept a bug report about 0.8.2. This version is too old. If you want a review of new feature implemented in 0.9.x series, please take a look into this paage: http://www.digikam.org/?q=changelog

... you will see than a huge list of improvements have been done (:=)))

Also, do not post more than major subject in the same report: it's infernal to manage in bugzilla.

Gilles Caulier

Please, update last stable release
Comment 3 caulier.gilles 2020-08-10 10:14:28 UTC
Since 6.0.0, the Metadata Engine used in digiKam core is fully wrapped with C++ Exception and all Exiv2 call are protected with thread lock. I write an unit test to check performance and stability using read/write of meta-data with Exiv2 with multicore/multithread environment. My huge collection of photo (RW/JPEG/PNG/TIFF, etc.) pass without dysfunction.

Gilles Caulier