Summary: | Non printable characters in IPTC keyword set by digiKam (UTF-8 support with IPTC metadata) [patch] | ||
---|---|---|---|
Product: | [Applications] digikam | Reporter: | Jean-Marc Liotier <jm> |
Component: | Metadata-Iptc | Assignee: | Digikam Developers <digikam-bugs-null> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | caulier.gilles, colin, ebayard63-projet, kde_bugs, msylwester |
Priority: | NOR | ||
Version: | 0.9.2 | ||
Target Milestone: | --- | ||
Platform: | Ubuntu | ||
OS: | Linux | ||
Latest Commit: | http://commits.kde.org/libkexiv2/e4cfee882303b50f17e6301a9fa7e00ab821336b | Version Fixed In: | 4.0.0 |
Sentry Crash Report: | |||
Attachments: |
Added/changed IPTC keywords conversion to/from utf8
My test case Use UTF-8 when reading/writing IPTC, set charset to UTF-8. |
Description
Jean-Marc Liotier
2008-03-13 01:02:53 UTC
digiKam always use ascii characters set to generate IPTC text field Gilles Caulier So you think that this is rather a rendering problem on the Gallery 2 side ? Yes... Gilles Or if you have used GalleryExport kipi-plugin, a problem relevant of this tool... Colin, What do you think about ? Gilles No use of the GalleryExport kipi-plugin : I copied the files to the server through scp and I imported them into Gallery 2 from there. I opened a ticket in the Gallery 2 project. I'll let you know how it goes. http://sourceforge.net/tracker/index.php?func=detail&aid=1913414&group_id=7130&atid=107130 Jean Marc, What's news about this report ? Gilles Caulier Probably fixed with commit 882181. Can you please try again and let me know? Maybe its a similar problem as 149029 ? Jean Marc, digiKam > 0.9.x support XMP. XMP replace IPTC and support UTF-8. IPTC has never supported UTF-8 and have several limitation over strings size. XMP do not have these limitation. Gallery server must support XMP by default and use it as well instead IPTC. Gilles Caulier True, IPTC is becoming obsolete but it is still used in $BIGNUMBER of programs and digiKam should be able to play with it nicely. Jean Marc, In Gallery Export plugin comming with kipi-plugins 0.4.0, i have fixed code to preserve metadata of exported and resized images. Can you test with this version ? Gilles Caulier Jean Marc, Do you see my comment #12 ? Gilles Caulier (In reply to comment #10) > Jean Marc, > > digiKam > 0.9.x support XMP. XMP replace IPTC and support UTF-8. IPTC has > never supported UTF-8 and have several limitation over strings size. XMP do > not have these limitation. > > Gallery server must support XMP by default and use it as well instead IPTC. > > Gilles Caulier Hi Gilles, Actually this is wrong. UFT is officially part of IPTC standard since 1997 (XMP were first used in 2001 by adobe in acrobat) You can find a lot of publication on this subject. But the best is to directly check the standard on the IPTC website. Note that tha latest IPTC standard are based on XMP implementation, but this not what we are dicussing here. example: http://www.gwww.wan-ifra.org%2Fsystem%2Ffiles%2Ffield_ifra_mag_file%2FF_tp980258.pdf&ei=BrXKUq_HEYGshQeBmoHwBg&usg=AFQjCNHAaCBNHuKLvObVXCLL-ZlWs4TrTQ&sig2=JdQvFsrXcaKJOOz-ieanoA&bvm=bv.58187178,d.ZG4 or quoted from: http://www.iptc.org/std/IIM/4.1/specification/IIMV4.1.pdf (year 1999) "Coded Character Set Optional, not repeatable, up to 32 octets, consisting of one or more control functions used for the announcement, invocation or designation of coded character sets. The control functions follow the ISO 2022 standard and may consist of the escape control character and one or more graphic characters. For more details see Appendix C, the IPTC-NAA Code Library. The control functions apply to character oriented DataSets in records 2-6. They also apply to record 8, unless the objectdata explicitly, or the File Format implicitly, defines character sets otherwise. If this DataSet contains the designation function for Unicode in UTF-8 then no other announcement, designation or invocation functions are permitted in this DataSet or in records 2-6. For all other character sets, one or more escape sequences are used...." or from the metadata working group that sets the standards and use them of course: www.metadataworkinggroup.com/pdf/mwg_guidance.pdf page 28 (Note that the whole section is very interesting for digikam as it speaks about metadata reconciliation guidance) "IPTC-IIM SHOULD be written using the Coded Character Set (1:90) as UTF-8 (see “Section 1.6 Coded Character Set” in the IIM specification). If the IPTC-IIM has not been written in UTF-8 before, a robust Changer SHOULD convert all properties to UTF-8 and write the corresponding identifier for UTF-8 to the 1:90 DataSet... In a word DIGIKAM is not standard compliant It makes it incompatible with all other image management software, such as those from adobe. It also corrupts our metadata which really is a shame because it is still a really good piece of software Regards Eric Well, don't forget that digiKAm do not write IPTC metadata in file. All is delegate to Exiv2. We pass UTF-8 string to exiv2 which choose the best way to store data in right fomat. If Exiv2 support this IPTC feature, why not. I pretty sure that Exiv2 don't do it. Please report this problem to Exiv2 bugzilla first. Gilles Caulier I had similar problem to original one: after metadata sync from images to DB I was getting a lot of tags with all non-ascii characters converted to "?" in addition to correct ones. exiftool helped me to trace this to the IPTC tags, hence here I am. I tried to follow up on where exactly the characters are replaced by "?" and it seems to be KExiv2::setIptcKeywords , where it uses toLatin1() which works exactly this way. Some changes around to change it to Utf8 made it pass my test case. When I made digikam use the modified lib it wrote the tags correctly, but they were still displayed garbled in the tags panel. Created attachment 85355 [details]
Added/changed IPTC keywords conversion to/from utf8
Created attachment 85356 [details]
My test case
With modified library input/output is consistent with using exiftool from command line.
Created attachment 85394 [details]
Use UTF-8 when reading/writing IPTC, set charset to UTF-8.
I did a little more digging, and it seems that to store UTF-8 in IPTC it is necessary to set IPTC:CodedCharacterSet to "\33%G", otherwise it's still ASCII. I've updated my path to always read and save as UTF8. Conversion ASCII->UTF8 when reading should be safe. Every time a field is changed the charset is set to UTF8, so writing should be fine as well. I think it's an overkill but checking all the characters every time is perhaps even worse.
Apparently, other charsets could also be supported, but I couldn't find any details...
Git commit e4cfee882303b50f17e6301a9fa7e00ab821336b by Gilles Caulier. Committed on 03/03/2014 at 08:47. Pushed by cgilles into branch 'master'. Review and apply patch #85394 from Michal Sylwester about to support UTF-8 encoding/decoding with IPTC metadata. Tested with non UTF8 IPTC image. Char still decoded as ASCII if Iptc.Envelope.CharacterSet is not present. FIXED-IN: 4.0.0 M +72 -37 libkexiv2/kexiv2iptc.cpp http://commits.kde.org/libkexiv2/e4cfee882303b50f17e6301a9fa7e00ab821336b |