Version: 0.10.0 (using KDE 4.2.2) OS: Linux Installed from: Ubuntu Packages The original IPTC standard allows only printable ASCII characters. When using UTF-8 characters in Digikam (e.g. author, copyright, keywords), these are synced to IPTC wrongly - majority of unknown characters are replaced by a question mark, while some characters still survive (I assume those defined in ISO-8859-1 / Latin1 set). I would assume that non-ASCII text should be transliterated to ASCII equivalent, if possible. See the screenshot here: http://www.milan-knizek.net/files/tmp/digikam_01.png It shows both UTF-8 console and Digikam output and also the iconv command for transliteration. (Ignore the repeated keyword "Kašpárek" in IPTC displayed by Digikam, this seems to be another bug reported by someone else earlier.)
Milan, This is the code : http://lxr.kde.org/source/KDE/kdegraphics/libs/libkexiv2/libkexiv2/kexiv2iptc.cpp#357 The constraint is below : QString::toAscii() : http://doc.trolltech.com/4.5/qstring.html#toAscii It's QT4 API. Gilles Caulier
According to Metadata Working Group guidelines data should be written back to IPTC in UTF-8. http://www.metadataworkinggroup.org/pdf/mwg_guidance.pdf page 28: If the IPTC-IIM has not been written in UTF-8 before, a robust Changer SHOULD convert all properties to UTF-8 and write the corresponding identifier for UTF-8 to the 1:90 DataSet.
Gilles, thanks for the explanation. Not being a programmer, I assume that it would be easier to change Digikam to use UTF-8 for IPTC as proposed by Mikolaj, than to change the Qt4 API. In the meantime, I stick with pure ASCII text in XMP, since I want to have it synced with IPTC, at least for the foreseeable future.
Not being a programmer but in linked qt4 doc I read: -- If a codec has been set using QTextCodec::setCodecForCStrings(), it is used to convert Unicode to 8-bit char; otherwise this function does the same as toLatin1(). -- This seems it's happening - instead of to ASCII it converts chars to Latin1 (and thus leave some characters illegal in IPTC). My RAW converter RawThepraee crashes because of illegal chars present in IPTC fields... Hope to see this resolved somehow.
I can add that this issue create problems when exporting pictures from Digikam to Gallery (http://gallery.menalto.com/). The tags containing non-english characters becomes corrupt. Is it this bug which is causing the problem, or is it Gallery who should accept utf-8 encoded IPTC?
Definitively, IPTC do not accept UTF-8. Use XMP instead which support it. Gilles Caulier
The trouble is that the UTF-8 strings are converted to Latin1 and some characters are corrupted. This does not seem to be a bug in QT4, it is a feature of the above mentioned function. Is it possible to use some other convert-to-7bit-ascii function, which takes care about transliteration like iconv?
While waiting for a real solution, is there any easy way to make a script that convert the strings to ascii without loosing the non-english characters?
IPTC can support UTF-8 with the CodedCharacterSet tag
No. IPTC do not support UTF8 officially in specification. XMP do it. It's not the same... This is why XMP have been created by Adobe (it's not the only problem of course, as string char limitation in IPTC). Gilles Caulier
Coming back to this file, there are some questions for Andreas: Indeed exiv2 seems to be doing some charset detection in the IPTC implementation, with detectCharset returning "UTF-8" or "ASCII". - are the returned std::strings from the ITPCData in this encoding? - what would a return value of 0 tell us? - writing: need the std::strings added to IPTC data expected to be in the same encoding - is there a way to set/convert the encoding, possibly with the Coded Character Set 1:90 tag as mentioned in the MWG guidance or is this left to the application (read all strings, convert them, set the "Iptc.Envelope.CharacterSet" to the cryptic "\033%G" value what ever that is) (I believe we dont want to do that though, but write IPTC as 7bit ASCII everywhere)
Andreas, Do you see the previous comment from Marcel ? Gilles Caulier
Alan, We miss a feedback from Andreas in this file. See question from Marcel on comment #11 thanks in advance Gilles
I can't answer for Andreas, but my understanding is that UTF-8 is allowed and optional in IPTC-IIM. My own tests within exiv2 show that unicode characters are preserved when syncing between XMP and IPTC. I probably missed some cases though, as I was not explicitly looking for cases where it did not. I don't think converting is needed. If unicode exists in XMP, it can be preserved in IPTC. This is way over my head technically, but the IPTC spec (version 3, October 1995) says: 1:90 Coded Character Set Optional, not repeatable, up to 32 octets, consisting of the escape control character, and graphic characters. One or more escape sequences for the announcement of the code extension facilities used in the data which follows, for the initial designation of the G0, G1, G2 and G3 graphic character sets and the initial invocation of the graphic set (7 bits) or the left-hand and the right-hand graphic set (8 bits) and for the initial invocation of the C0 (7 bits) or of the C0 and the C1 control character sets (8 bits) in use for data fields in records 2-6 and 8. Follows the ISO 2022 standard. The recognised graphic repertoire and control function repertoire are listed in Appendix C. The announcement of the code extension facilities, if transmitted, must appear in this data set. Designation and invocation of graphic and control function sets (shifting) may be transmitted anywhere where the escape and the other necessary control characters are permitted. However, it is recommended to transmit in this data set an initial designation and invocation, i.e. to define all designations and the shift status currently in use by transmitting the appropriate escape sequences and locking-shift functions. If 1:90 is omitted, the default for records 2-6 and 8 is ISO 646 IRV (7 bits) or ISO 4873 DV (8 bits). Record 1 shall always use ISO 646 IRV or ISO 4873 DV respectively.
This entry is illegible for GSoC 2016 project : https://community.kde.org/GSoC/2016/Ideas#Project:_digiKam_MetadataHub_improvements
As stated by Mikolaj Machowski UFT8 is explicitley and clearly part of IPTC standard. just checking on the IPTC standard group website will confirm it. http://www.metadataworkinggroup.org/pdf/mwg_guidance.pdf The metadata working group is the group that nows, set the iptc and xmp standards All software to manipulate photo metadata I know of are able to deal with IPTC written in UFT8 (adobe, window, exiftool,...). This is a big problem for Digikam as all chinese,... caracters gets "corrupted". This is the only thing that locks me with lightroom. I would love see this change. regards
Actually this issue is related/has been reproted several times ex: issues number 195508, 370558, 379050, 379581
For digiKam 7.1.0, Iptc in digiKam will support fully UTF8. Bugs #379581, #379050, #370558 are now closed. Gilles Caulier