Bug 159220

Summary: Non printable characters in IPTC keyword set by digiKam (UTF-8 support with IPTC metadata) [patch]
Product: [Applications] digikam Reporter: Jean-Marc Liotier <jm>
Component: Metadata-IptcAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED FIXED    
Severity: normal CC: caulier.gilles, colin, ebayard63-projet, kde_bugs, msylwester
Priority: NOR    
Version: 0.9.2   
Target Milestone: ---   
Platform: Ubuntu   
OS: Linux   
Latest Commit: Version Fixed In: 4.0.0
Sentry Crash Report:
Attachments: Added/changed IPTC keywords conversion to/from utf8
My test case
Use UTF-8 when reading/writing IPTC, set charset to UTF-8.

Description Jean-Marc Liotier 2008-03-13 01:02:53 UTC
Version:           0.9.2-final (using KDE 3.5.8)
Installed from:    Ubuntu Packages
OS:                Linux

Example at http://gallery.ruwenzori.net/main.php/v/travel/SenegalPaulineVelo2008/20080228_113839_1454_SenegalPaulineVelo.jpg.html :

Look at the "IPTC: Keywords" field in the "Photo Properties" details. You'll notice that all tags are like "Transport/bicycle�;" with the "white interrogation mark on black hexagon" character just before the semicolon. All those tags were generated using Digikam.

Counter example at http://gallery.ruwenzori.net/main.php/v/misc/debugging/20080301_113020_1640_SenegalPaulineVelo-bis.jpg.html :

Look at the same field. The ones with the same problem were also generated using Digikam. The clean ones were added using exiv2 from the command line.

Several hypothesis are credible :
- Digikam produces perfectly good tags but Gallery 2 somehow misinterprets them. This seems invalidated by the clean tags produced by exiv2 which a reasonably trusted implementation.
- Digikam produces tags with an illegal non-printable character.

Note that this is not a browser character set problem : whether I force UTF-8 or ISO-8859-1 as the character encoding for rendering the page, the special character is always there.
Comment 1 caulier.gilles 2008-03-13 06:25:39 UTC
digiKam always use ascii characters set to generate IPTC text field

Gilles Caulier
Comment 2 Jean-Marc Liotier 2008-03-13 11:23:37 UTC
So you think that this is rather a rendering problem on the Gallery 2 side ?
Comment 3 caulier.gilles 2008-03-13 11:26:14 UTC
Yes...

Gilles
Comment 4 caulier.gilles 2008-03-13 11:34:32 UTC
Or if you have used GalleryExport kipi-plugin, a problem relevant of this tool...

Colin, What do you think about ?

Gilles
Comment 5 Jean-Marc Liotier 2008-03-13 11:39:16 UTC
No use of the GalleryExport kipi-plugin : I copied the files to the server through scp and I imported them into Gallery 2 from there.
Comment 6 Jean-Marc Liotier 2008-03-13 11:39:33 UTC
I opened a ticket in the Gallery 2 project. I'll let you know how it goes.
http://sourceforge.net/tracker/index.php?func=detail&aid=1913414&group_id=7130&atid=107130
Comment 7 caulier.gilles 2008-12-04 20:54:13 UTC
Jean Marc,

What's news about this report ?

Gilles Caulier
Comment 8 Andrea Diamantini 2008-12-09 00:29:16 UTC
Probably fixed with commit 882181. Can you please try again and let me know?
Comment 9 Ian Hubbertz 2009-01-28 23:03:20 UTC
Maybe its a similar problem as 149029 ?
Comment 10 caulier.gilles 2009-06-09 12:17:05 UTC
Jean Marc,

digiKam > 0.9.x support XMP. XMP replace IPTC and support UTF-8. IPTC has never supported UTF-8 and have several limitation over strings size. XMP do not have these limitation. 

Gallery server must support XMP by default and use it as well instead IPTC.

Gilles Caulier
Comment 11 Mikolaj Machowski 2009-06-09 17:10:22 UTC
True, IPTC is becoming obsolete but it is still used in $BIGNUMBER of programs and digiKam should be able to play with it nicely.
Comment 12 caulier.gilles 2009-07-20 11:17:32 UTC
Jean Marc,

In Gallery Export plugin comming with kipi-plugins 0.4.0, i have fixed code to preserve metadata of exported and resized images.

Can you test with this version ?

Gilles Caulier
Comment 13 caulier.gilles 2011-12-16 16:34:30 UTC
Jean Marc,

Do you see my comment #12 ?

Gilles Caulier
Comment 14 Eric Bayard 2014-01-06 14:17:12 UTC
(In reply to comment #10)
> Jean Marc,
> 
> digiKam > 0.9.x support XMP. XMP replace IPTC and support UTF-8. IPTC has
> never supported UTF-8 and have several limitation over strings size. XMP do
> not have these limitation. 
> 
> Gallery server must support XMP by default and use it as well instead IPTC.
> 
> Gilles Caulier

Hi Gilles,
Actually this is wrong. UFT is officially part of IPTC standard since 1997 (XMP were first used in 2001 by adobe in acrobat)

You can find a lot of publication on this subject. But the best is to directly check the standard on the IPTC website. Note that tha latest IPTC standard are based on XMP implementation, but this not what we are dicussing here.

example: 
http://www.gwww.wan-ifra.org%2Fsystem%2Ffiles%2Ffield_ifra_mag_file%2FF_tp980258.pdf&ei=BrXKUq_HEYGshQeBmoHwBg&usg=AFQjCNHAaCBNHuKLvObVXCLL-ZlWs4TrTQ&sig2=JdQvFsrXcaKJOOz-ieanoA&bvm=bv.58187178,d.ZG4
 
or quoted from: http://www.iptc.org/std/IIM/4.1/specification/IIMV4.1.pdf  (year 1999)

 "Coded
Character
Set
Optional, not repeatable, up to 32 octets, consisting of one or
more control functions used for the announcement, invocation or
designation of coded character sets. The control functions follow
the ISO 2022 standard and may consist of the escape control
character and one or more graphic characters. For more details
see Appendix C, the IPTC-NAA Code Library.
The control functions apply to character oriented DataSets in
records 2-6. They also apply to record 8, unless the objectdata
explicitly, or the File Format implicitly, defines character sets
otherwise.
If this DataSet contains the designation function for Unicode in
UTF-8 then no other announcement, designation or invocation
functions are permitted in this DataSet or in records 2-6.
For all other character sets, one or more escape sequences are
used...."

or from the metadata working group that sets the standards and use them of course:
www.metadataworkinggroup.com/pdf/mwg_guidance.pdf  page 28 (Note that the whole section is very interesting for digikam as it speaks about metadata reconciliation guidance)

"IPTC-IIM SHOULD be written using the Coded Character Set (1:90) as UTF-8 (see “Section 1.6 Coded Character Set” in the IIM specification).

If the IPTC-IIM has not been written in UTF-8 before, a robust Changer SHOULD convert all properties to UTF-8 and write the corresponding identifier for UTF-8 to the 1:90 DataSet...

In a word  DIGIKAM is not standard compliant
It  makes it incompatible with all other image management software, such as those from adobe. 
It also corrupts our metadata which really is a shame because it is still a really good piece of software

Regards
Eric
Comment 15 caulier.gilles 2014-01-06 14:24:39 UTC
Well, don't forget that digiKAm do not write IPTC metadata in file. All is delegate to Exiv2. We pass UTF-8 string to exiv2 which choose the best way to store data in right fomat.

If Exiv2 support this IPTC feature, why not. I pretty sure that Exiv2 don't do it.

Please report this problem to Exiv2 bugzilla first.

Gilles Caulier
Comment 16 Michal Sylwester 2014-02-28 05:07:13 UTC
I had similar problem to original one: after metadata sync from images to DB I was getting a lot of tags with all non-ascii characters converted to "?" in addition to correct ones. exiftool helped me to trace this to the IPTC tags, hence here I am.

I tried to follow up on where exactly the characters are replaced by "?" and it seems to be KExiv2::setIptcKeywords , where it uses toLatin1() which works exactly this way. Some changes around to change it to Utf8 made it pass my test case. When I made digikam use the modified lib it wrote the tags correctly, but they were still displayed garbled in the tags panel.
Comment 17 Michal Sylwester 2014-02-28 05:18:28 UTC
Created attachment 85355 [details]
Added/changed IPTC keywords conversion to/from utf8
Comment 18 Michal Sylwester 2014-02-28 05:21:48 UTC
Created attachment 85356 [details]
My test case

With modified library input/output is consistent with using exiftool from command line.
Comment 19 Michal Sylwester 2014-03-03 06:17:02 UTC
Created attachment 85394 [details]
Use UTF-8 when reading/writing IPTC, set charset to UTF-8.

I did a little more digging, and it seems that to store UTF-8 in IPTC it is necessary to set IPTC:CodedCharacterSet to "\33%G", otherwise it's still ASCII. I've updated my path to always read and save as UTF8. Conversion ASCII->UTF8 when reading should be safe. Every time a field is changed the charset is set to UTF8, so writing should be fine as well. I think it's an overkill but checking all the characters every time is perhaps even worse.

Apparently, other charsets could also be supported, but I couldn't find any details...
Comment 20 caulier.gilles 2014-03-03 08:50:12 UTC
Git commit e4cfee882303b50f17e6301a9fa7e00ab821336b by Gilles Caulier.
Committed on 03/03/2014 at 08:47.
Pushed by cgilles into branch 'master'.

Review and apply patch #85394 from Michal Sylwester about to support UTF-8 encoding/decoding with IPTC metadata.
Tested with non UTF8 IPTC image. Char still decoded as ASCII if Iptc.Envelope.CharacterSet is not present.
FIXED-IN: 4.0.0

M  +72   -37   libkexiv2/kexiv2iptc.cpp

http://commits.kde.org/libkexiv2/e4cfee882303b50f17e6301a9fa7e00ab821336b