Bug 195508

Summary: HUB : Syncing IPTC with UTF-8 characters from XMP after conversion to printable ASCII
Product: [Applications] digikam Reporter: Milan Knížek <knizek>
Component: Metadata-HubAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED FIXED    
Severity: wishlist CC: ahuggel, alan.pater, aspotashev, caulier.gilles, dani, ebayard63-projet, jhaugex, michal, timid3000
Priority: NOR    
Version: 0.10.0   
Target Milestone: ---   
Platform: Ubuntu   
OS: Linux   
Latest Commit: Version Fixed In: 7.1.0
Sentry Crash Report:

Description Milan Knížek 2009-06-06 23:18:21 UTC
Version:           0.10.0 (using KDE 4.2.2)
OS:                Linux
Installed from:    Ubuntu Packages

The original IPTC standard allows only printable ASCII characters.

When using UTF-8 characters in Digikam (e.g. author, copyright, keywords), these are synced to IPTC wrongly - majority of unknown characters are replaced by a question mark, while some characters still survive (I assume those defined in ISO-8859-1 / Latin1 set).

I would assume that non-ASCII text should be transliterated to ASCII equivalent, if possible.

See the screenshot here:
http://www.milan-knizek.net/files/tmp/digikam_01.png

It shows both UTF-8 console and Digikam output and also the iconv command for transliteration.

(Ignore the repeated keyword "Kašpárek" in IPTC displayed by Digikam, this seems to be another bug reported by someone else earlier.)
Comment 1 caulier.gilles 2009-06-07 10:21:57 UTC
Milan,

This is the code :

http://lxr.kde.org/source/KDE/kdegraphics/libs/libkexiv2/libkexiv2/kexiv2iptc.cpp#357

The constraint is below : 

QString::toAscii() : http://doc.trolltech.com/4.5/qstring.html#toAscii

It's QT4 API.

Gilles Caulier
Comment 2 Mikolaj Machowski 2009-06-07 12:55:35 UTC
According to Metadata Working Group guidelines data should be written back to IPTC in UTF-8.

http://www.metadataworkinggroup.org/pdf/mwg_guidance.pdf
page 28:
 If the IPTC-IIM has not been written in UTF-8 before, a robust Changer SHOULD
convert all properties to UTF-8 and write the corresponding identifier for UTF-8 to the 1:90 DataSet.
Comment 3 Milan Knížek 2009-06-07 21:47:42 UTC
Gilles,

thanks for the explanation.

Not being a programmer, I assume that it would be easier to change Digikam to use UTF-8 for IPTC as proposed by Mikolaj, than to change the Qt4 API.

In the meantime, I stick with pure ASCII text in XMP, since I want to have it synced with IPTC, at least for the foreseeable future.
Comment 4 Michal Thoma 2009-06-13 23:03:55 UTC
Not being a programmer but in linked qt4 doc I read:

--
If a codec has been set using QTextCodec::setCodecForCStrings(), it is used to convert Unicode to 8-bit char; otherwise this function does the same as toLatin1().
--

This seems it's happening - instead of to ASCII it converts chars to Latin1 (and thus leave some characters illegal in IPTC).

My RAW converter RawThepraee crashes because of illegal chars present in IPTC fields... Hope to see this resolved somehow.
Comment 5 Jostein Hauge 2010-03-04 22:16:16 UTC
I can add that this issue create problems when exporting pictures from Digikam to Gallery (http://gallery.menalto.com/). The tags containing non-english characters becomes corrupt.

Is it this bug which is causing the problem, or is it Gallery who should accept utf-8 encoded IPTC?
Comment 6 caulier.gilles 2010-03-05 08:29:47 UTC
Definitively, IPTC do not accept UTF-8. Use XMP instead which support it.

Gilles Caulier
Comment 7 Milan Knížek 2010-03-05 21:50:59 UTC
The trouble is that the UTF-8 strings are converted to Latin1 and some characters are corrupted. This does not seem to be a bug in QT4, it is a feature of the above mentioned function.

Is it possible to use some other convert-to-7bit-ascii function, which takes care about transliteration like iconv?
Comment 8 Jostein Hauge 2010-03-06 01:21:45 UTC
While waiting for a real solution, is there any easy way to make a script that convert the strings to ascii without loosing the non-english characters?
Comment 9 Kévin FERRARE 2011-01-25 07:02:15 UTC
IPTC can support UTF-8 with the CodedCharacterSet tag
Comment 10 caulier.gilles 2011-01-25 07:34:19 UTC
No. IPTC do not support UTF8 officially in specification. XMP do it. It's not the same... This is why XMP have been created by Adobe (it's not the only problem of course, as string char limitation in IPTC).

Gilles Caulier
Comment 11 Marcel Wiesweg 2012-06-24 13:35:12 UTC
Coming back to this file, there are some questions for Andreas:
Indeed exiv2 seems to be doing some charset detection in the IPTC implementation, with detectCharset returning "UTF-8" or "ASCII".
- are the returned std::strings from the ITPCData in this encoding?
- what would a return value of 0 tell us?
- writing: need the std::strings added to IPTC data expected to be in the same encoding
- is there a way to set/convert the encoding, possibly with the Coded Character Set 1:90 tag as mentioned in the MWG guidance or is this left to the application (read all strings, convert them, set the "Iptc.Envelope.CharacterSet" to the cryptic "\033%G" value what ever that is)
(I believe we dont want to do that though, but write IPTC as 7bit ASCII everywhere)
Comment 12 caulier.gilles 2013-11-27 13:17:36 UTC
Andreas, 

Do you see the previous comment from Marcel ?

Gilles Caulier
Comment 13 caulier.gilles 2015-05-15 21:36:49 UTC
Alan,

We miss a feedback from Andreas in this file. See question from Marcel on comment #11

thanks in advance

Gilles
Comment 14 Alan Pater 2015-05-15 23:47:06 UTC
I can't answer for Andreas, but my understanding is that UTF-8 is allowed and optional in IPTC-IIM. My own tests within exiv2 show that unicode characters are preserved when syncing between XMP and IPTC. I probably missed some cases though, as I was not explicitly looking for cases where it did not. I don't think converting is needed. If unicode exists in XMP,  it can be preserved in IPTC. 

This is way over my head technically, but the IPTC spec (version 3, October 1995) says:

1:90 Coded Character Set
Optional, not repeatable, up to 32 octets, consisting of the
escape control character, and graphic characters.
One or more escape sequences for the announcement of the
code extension facilities used in the data which follows, for the
initial designation of the G0, G1, G2 and G3 graphic character
sets and the initial invocation of the graphic set (7 bits) or the
left-hand and the right-hand graphic set (8 bits) and for the initial
invocation of the C0 (7 bits) or of the C0 and the C1 control
character sets (8 bits) in use for data fields in records 2-6 and 8.
Follows the ISO 2022 standard. The recognised graphic
repertoire and control function repertoire are listed in Appendix
C.
The announcement of the code extension facilities, if
transmitted, must appear in this data set. Designation and
invocation of graphic and control function sets (shifting) may be
transmitted anywhere where the escape and the other
necessary control characters are permitted. However, it is
recommended to transmit in this data set an initial designation
and invocation, i.e. to define all designations and the shift status
currently in use by transmitting the appropriate escape
sequences and locking-shift functions.
If 1:90 is omitted, the default for records 2-6 and 8 is ISO 646
IRV (7 bits) or ISO 4873 DV (8 bits). Record 1 shall always use
ISO 646 IRV or ISO 4873 DV respectively.
Comment 15 caulier.gilles 2016-03-07 07:21:52 UTC
This entry is illegible for GSoC 2016 project  :

https://community.kde.org/GSoC/2016/Ideas#Project:_digiKam_MetadataHub_improvements
Comment 16 Eric Bayard 2019-02-23 09:39:31 UTC
As stated by Mikolaj Machowski UFT8 is explicitley and clearly part of IPTC standard.

just checking on the IPTC standard group website will confirm it.

http://www.metadataworkinggroup.org/pdf/mwg_guidance.pdf

The metadata working group is the group that nows, set the iptc and xmp standards

All software to manipulate photo metadata I know of are able to deal with IPTC written in UFT8 (adobe, window, exiftool,...).
This is a big problem for Digikam as all chinese,... caracters gets "corrupted".
This is the only thing that locks me with lightroom.

I would love see this change.
regards
Comment 17 Eric Bayard 2019-02-23 10:38:01 UTC
Actually this issue is related/has been reproted several times  ex:
issues number 195508, 370558, 379050, 379581
Comment 18 caulier.gilles 2020-08-29 20:20:35 UTC
For digiKam 7.1.0, Iptc in digiKam will support fully UTF8. 

Bugs #379581, #379050, #370558 are now closed.

Gilles Caulier