Bug 283013

Summary: Accelerating writing metadata back to image files
Product: [Applications] digikam Reporter: Gerhard Kulzer <gerhardk>
Component: Metadata-EngineAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED UPSTREAM    
Severity: wishlist CC: ahuggel, althio.forum, aspotashev, axel.krebs, caulier.gilles, toddrme2178
Priority: NOR    
Version: 2.1.1   
Target Milestone: ---   
Platform: Compiled Sources   
OS: Linux   
Latest Commit: Version Fixed In: 7.5.0
Sentry Crash Report:
Bug Depends on: 188925    
Bug Blocks:    

Description Gerhard Kulzer 2011-09-29 07:15:57 UTC
Version:           unspecified (using KDE 4.7.0) 
OS:                Linux

I'm not sure how the files are updated when writing metadata back to images (probably kioslave task). My suspicoin is that the whole file is written back and not just the metadata part.
When I backup my images with rsync (using Luckybackup GUI) after I've modified metadata with digikam, the update seems much quicker than in digikam. And with rsync it is clear that only a small part of every file is being rewritten
(example log:    32.77K   1%   71.75kB/s    0:01:31)

If my understanding is correct, it should be relatively easy to do the same with digikam, just using rsync?

  


Reproducible: Always

Steps to Reproduce:
Update some metadata in digikam, then backup the same data using rsync.

Actual Results:  
rsync update is much faster.

Expected Results:  
accelerate writing metadata back to image files
Comment 1 Marcel Wiesweg 2011-09-29 18:20:10 UTC
This is fully relevant of exiv2. I dont even know anything the details of metadata writing.
Comment 2 caulier.gilles 2011-09-29 20:48:27 UTC
Andreas Huggel from Exiv2 is in copy for more details.

Gilles Caulier
Comment 3 Andreas Huggel 2011-09-30 06:13:52 UTC
The Exiv2 write logic is optimized based on the image format and the kind of changes to the metadata. The two classes of image formats are TIFF-like images, where the metadata is not in a specific portion of the image but potentially spread over the entire image (image == metadata) and images which keep the metadata in a specific portion of the file (e.g., JPEG, PNG). The type of changes distinguish "intrusive" and "non-intrusive" changes. If any metadata tags are added, deleted or an existing metadata field is extended, the change is intrusive and requires Exiv2 to re-serialize the entire metadata structure. If an existing field is changed and its size is not extended (it can shrink), then Exiv2 makes the change in-place, without rewriting the entire metadata. This has the considerable advantage that the TIFF structure stays intact, even if Exiv2 can't parse it. A typical examples for a non-intrusive change is changing the Exif date/time of an image.

Writing works as follows:

                 intrusive    non-intrusive
                 ------------ -------------
TIFF-like      : copy         mmap
Metadata block : copy         copy*

"copy" means the file is re-written and re-named (its size changes)
"mmap" means the file is changed in-place (the file size remains the same)

* In this case, the metadata structure is changed in-place but the file is copied and in the process, the new metadata block is inserted.

The only further optimization I can see is that in the case of images with a metadata block and non-intrusive changes, it would be possible to change the entire file in-place rather than only the metadata block.

For additional considerations (memory related), see http://dev.exiv2.org/issues/617

How does rsync work? Does it really operate on portions of a file (not only modified files + compression)?

-ahu.
Comment 4 Gerhard Kulzer 2011-09-30 08:05:03 UTC
First, thank you very much Andreas for this detailed explanation, it's good to memorize this one.

Concerning the rsync mechanisms, I found this description on the Wikipedia site of rsync:


"The rsync utility uses an algorithm invented by the Australian computer programmer Andrew Tridgell for efficiently transmitting a structure (such as a file) across a communications link when the receiving computer already has a similar, but not identical, version of the same structure.

The recipient splits its copy of the file into fixed-size non-overlapping chunks and computes two checksums for each chunk: the MD4 hash, and a weaker 'rolling checksum'. (Version 30 of the protocol, released with rsync version 3.0.0, now uses MD5 hashes rather than MD4.[14]) It sends these checksums to the sender.

The sender computes the rolling checksum for every chunk of size S in its own version of the file, even overlapping chunks. This can be calculated efficiently because of a special property of the rolling checksum: if the rolling checksum of bytes n through n + S − 1 is R, the rolling checksum of bytes n + 1 through n + S can be computed from R, byte n, and byte n + S without having to examine the intervening bytes. Thus, if one had already calculated the rolling checksum of bytes 1–25, one could calculate the rolling checksum of bytes 2–26 solely from the previous checksum, and from bytes 1 and 26.

The rolling checksum used in rsync is based on Mark Adler's adler-32 checksum, which is used in zlib, and is itself based on Fletcher's checksum.

The sender then compares its rolling checksums with the set sent by the recipient to determine if any matches exist. If they do, it verifies the match by computing the hash for the matching block and by comparing it with the hash for that block sent by the recipient.

The sender then sends the recipient those parts of its file that did not match the recipient's blocks, along with information on where to merge these blocks into the recipient's version. This makes the copies identical."

There is a longish but nice interview with Andrew Tridgell, the creator of rsync here: http://oceanpark.com/webmuseum/rsync.html

So it works on blocks, which seem to be chunks of 500-1000 bytes (as I read on various sources). Anyways, judging from the logs I get from rsyncing, the change size is usually less than 1% of an image, and that may contain several blocks of course.
Comment 5 caulier.gilles 2011-12-17 10:34:00 UTC
Note this file depends of #188925 for few points...

Gilles Caulier
Comment 6 caulier.gilles 2013-11-06 17:05:24 UTC
Note : writting metadata from Maintenance tool use now parallelized threads if you have multi-core CPU. This will increase a little bit the speed of process to write metadata on files.

But the lead problem here, if i'm not too wrong still in Exiv2 shared library...

Gilles Caulier
Comment 7 caulier.gilles 2013-12-01 23:57:01 UTC
*** Bug 252494 has been marked as a duplicate of this bug. ***
Comment 8 caulier.gilles 2014-08-28 13:12:06 UTC
This file is definitively an UPSTREAM entry which much be reported to Exiv2 bugzilla, as low level writing metadata to files are processed in background by Exiv2.

Gilles Caulier
Comment 9 caulier.gilles 2014-08-28 15:49:12 UTC
*** Bug 269467 has been marked as a duplicate of this bug. ***