Bug 451524 - synchronizing metadata from database to files should not modify files whose metadata have not changed
Summary: synchronizing metadata from database to files should not modify files whose m...
Status: RESOLVED DUPLICATE of bug 411244
Alias: None
Product: digikam
Classification: Applications
Component: Maintenance-Metadata (other bugs)
Version First Reported In: 7.4.0
Platform: Other Linux
: NOR normal
Target Milestone: ---
Assignee: Digikam Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-15 12:11 UTC by Jonathan Kamens
Modified: 2023-05-02 06:09 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jonathan Kamens 2022-03-15 12:11:21 UTC
When you tell the maintenance tool to synchronize metadata from the database to the image files, it writes semantically meaningless changes to the metadata in some cases, e.g., changing the order of names in the "Subject" field, and updates the timestamp on every single image file even if the file is 100% identical in content, i.e., no changes were made to the metadata.

This is terrible behavior, for three reasons: (1) it takes MUCH longer to write every single file when most aren't actually modified in any meaningful way; (2) every time you write to a file there's a chance of corruption, so unnecessary writes should be avoided; (3) it causes every single file in the collection to be backed up again, consuming bandwidth and backup space and creating cruft in the backup for those of us who keep multiple revisions of backed-up files.

Since I assume that the digiKam is writing the metadata into a temporary file and then replacing the original file with the temporary one (if not, it should definitely be doing that!), it should actually check while writing the metadata whether anything meaningful was actually changed (keeping in mind that some metadata should be compared without regard to sort order), and if not, then just delete the temporary file rather than replacing the original with it.
Comment 1 caulier.gilles 2022-03-15 12:17:05 UTC
Hi,

All metadata operation (read/write) are done by libexiv2 shared library, not digiKam directly.

And yes, Exiv2 create a temporary files when changes are operated in image files

Can you compare metadata from an image file where nothing is changed and where image is touched. You can use ExifTool for this task. This will allow to identify which information are touched in this case.

Best

Gilles Caulier
Comment 2 Maik Qualmann 2022-03-15 12:30:34 UTC
More or less a duplicate bug report of Bug 411244.

Maik
Comment 3 Jonathan Kamens 2022-03-15 12:32:27 UTC
(In reply to caulier.gilles from comment #1)
> All metadata operation (read/write) are done by libexiv2 shared library, not
> digiKam directly.

I mean, yes, I get that, but isn't digiKam telling libexiv2 what data to write into the file? Could it not use libexiv2 to read what's already in the file, compare it to what is in the database, and only modify the file if there are actual differences?
 
> Can you compare metadata from an image file where nothing is changed and
> where image is touched. You can use ExifTool for this task. This will allow
> to identify which information are touched in this case.

Yes, I could do that, but I think any solution to this problem that is implemented needs to be more comprehensive than just "handle the fields that some guy on the internet listed in a bug ticket."

To fix the 6,000 files that were modified that didn't need to be in my collection last night, I wrote this Python script to figure out which files were actually substantively different. This will show you at the very least the fields that were unnecessarily modified in _my_ case (note, in particular, the `strip_ignored` and `fix_values` functions), but I can't claim that this covers everything:

```
#!/usr/bin/env python3

# Calls exiftool on two files. Reads the results and does a semantic
# comparison. Displays any differences and exits with non-zero status if there
# are differences. Ignores exiftool output lines that I've empirically
# determined are not reflective of substantive changes.

import copy
import pprint
import re
import subprocess
import sys
import xml.etree.ElementTree as ET


def exiftool_get(path):
    result = subprocess.run(('exiftool', path), encoding='us-ascii',
                            capture_output=True, check=True)
    values = {}
    for line in result.stdout.strip().split('\n'):
        key, value = re.split(r'\s*:\s*', line, 1)
        if (not value) or (value == '(none)'):
            continue
        if key in values:
            if isinstance(values[key], list):
                values[key].append(value)
            else:
                values[key] = [values[key], value]
        else:
            values[key] = value
    return values


def strip_ignored(exif):
    exif = {k: v for k, v in exif.items()
            if k not in ('Directory',
                         'File Modification Date/Time',
                         'File Access Date/Time',
                         'File Inode Change Date/Time',
                         'File Permissions', 'File Size',
                         'Region Applied To Dimensions H',
                         'Region Applied To Dimensions Unit',
                         'Region Applied To Dimensions W',
                         'Current IPTC Digest')}
    return exif


def fix_values(exif):
    exif = copy.deepcopy(exif)
    for k in ('Tags List', 'Subject', 'Catalog Sets', 'Last Keyword XMP',
              'Keywords', 'Hierarchical Subject'):
        if k in exif:
            exif[k] = tuple(sorted(re.split(r'\s*,\s*', exif[k])))
    if 'Categories' in exif:
        root = ET.fromstring(exif['Categories'])
        for category in root:
            category[:] = sorted(category, key=lambda child: child.text)
        exif['Categories'] = ET.tostring(root)
    return exif


def main():
    file1 = sys.argv[1]
    file2 = sys.argv[2]
    exif1 = exiftool_get(file1)
    exif2 = exiftool_get(file2)
    exif1 = strip_ignored(exif1)
    exif2 = strip_ignored(exif2)
    exif1 = fix_values(exif1)
    exif2 = fix_values(exif2)
    only1 = {}
    only2 = {}
    different = {}
    for k, v1 in exif1.items():
        if k not in exif2:
            only1[k] = v1
        elif v1 != exif2[k]:
            different[k] = (v1, exif2[k])
    for k, v2 in exif2.items():
        if k not in exif1:
            only2[k] = v2
    if not (only1 or only2 or different):
        return 0
    if only1:
        print(f'Only in {file1}:')
        pprint.pprint(only1)
    if only2:
        print(f'Only in {file2}:')
        pprint.pprint(only2)
    if different:
        print('Different:')
        pprint.pprint(different)
    return 1


if __name__ == '__main__':
    sys.exit(main())
```
Comment 4 caulier.gilles 2023-05-02 06:09:26 UTC

*** This bug has been marked as a duplicate of bug 411244 ***