Version: 1.0.0-beta3 (using 4.3.00 (KDE 4.3.0), Debian packages)
OS: Linux (i686) release 2.6.30-1-686
If storing to image metadata is activated, digiKam generates invalid EXIF UserComment charset identifiers.
The UserComment field is tagged as "ASCII" although the actual content seems to be encoded in my system's locale, ISO 8859-15.
I'd like to see the string encoded as UTF-8 and tagged as UNICODE if it cannot be represented as an ASCII string.
The current behaviour creates invalid UserComment fields and further metadata processing in other application gets messed up.
Looking at the code, we are writing either Latin1 or Unicode UCS-2.
Can you give us a sample image with invalid user comment field written by digikam?
Yes, digikam writes Latin 1 or maybe Latin 9 (ISO 8859-1 or -15) and tags the comment as "ASCII".
AFAIK Latin-charsets are not supported for this EXIF header field. This causes problems with other applications which process the comment. As far as I understand, the solution would be to encode the comment in UTF-8 if it contains non-ASCII characters (ie. code > 127) and tag the comment field appropriately.
However, maybe I'm wrong and UserComment field do support Latin charsets, somehow?
The Exif standard defines the use of Ascii, JIS or Unicode. The standard does not say which unicode variant, but apparently it is UCS-2 in practice (UTF-16, two bytes per character)
Ok, good to know. But not Latin1/9?
From my understanding, ASCII is Latin1.
From what I understand, ASCII is only a 7-bit-encoding (So a byte provides 7 data bits plus space for one optional parity bit.), in contrast to LatinX which are 8-bit-encodings. The first 128 characters of Latin1 match the ASCII character set, AFAIK, but all codes >= 128 are not defined for ASCII.
To be sure, I looked up the standardisation papers which seem to back my opinion:
The EXIF 2.2 standard (http://www.exif.org/Exif2-2.PDF) states on page 28 that the reference documentation for character code ASCII is ITU-T T.50 IA5 (ITU-T International Alphabet No. 5, now ITU-T IRA = International Reference Alphabet).
The International Reference Alphabet is a 7-bit-encoding, the ITU-T recommendation document can be found at: http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.50-199209-I!!PDF-E&type=items
So in my eyes Latin1 strings containing characters with character codes larger than 127 are not allowed in UserComment fields with an encoding type of "ASCII" (or any EXIF header field which mandates ASCII encoding) and the string must be recoded to unicode and be written as a UserComment field with type "Unicode". (It'd probably good for interoperability to use "ASCII" if no invalid characters appear within the string.)
In case of header fields which only allow ASCII encoding, transliteration for these invalid characters would need to be used. (iconv can do that, for example, converting eg. "ö" to "oe" and the like.)
Latin1 would be acceptable with an "undefined" encoding type (8 null bytes, see EXIF spec page 29), but that would not help interoperability at all...
The EXIF spec only refers to the unicode spec in case of a "Unicode" encoding type, so just as you I'm not sure which flavour of unicode could be used. I'm not familiar with the unicode spec and have not looked up any details so far, but the exact encoding of unicode files is determined by its first few bytes which must carry a Byte Order Mark (BOM) in case of UTF-16 and UTF-32, while this BOM is allowed but optional for UTF-8 files (http://en.wikipedia.org/wiki/Byte-order_mark). Maybe the encoding used for shorter unicode sequences like the UserComment string is also distinguished this way?
In this case it would probably be preferrable to use UTF-8 if the input is LatinX, as this should result in the shortest byte sequences after recoding. The "deluxe solution" in this case would be to dynamically use the unicode encoding which produces the shortes byte sequence.
Do you know which unicode formats are allowed for the Exif UserComment? Specifically, if Utf-8 is allowed or if UTF-16 is required? Would it be a good decision to always use UNICODE there?
I found only a hidden hint that seems to point to UTF-16 for a "UNICODE" UserComment. It's in the comments for tag ImageDescription, on page 22 of the Exif specs: "When a 2-byte code is necessary, the Exif Private tag UserComment is to be used".
Exiv2 doesn't do any conversion (yet...), it leaves it to the application to do the right thing.
For comparison, Exiftool writes the UserComment tag with an Exif character code "ASCII" if the text consists of only 7-bit characters, else it uses the Exif character code "UNICODE" and encodes the text in UTF-16.
It encodes the UTF-16 string using the same byte order as the rest of the Exif/TIFF structure and without a BOM.
On read it expects a UTF-16 encoded text, has some intelligence to guess the byte order, and interprets a BOM if there is one. It doesn't seems to have any provision for UTF-8 encoded UserComment text, though.
Exiv2 should probably follow a similar logic eventually, although I'd think that there are images with UTF-8 encoded UserComment tags out there in the wild.
With the fix for http://dev.exiv2.org/issues/show/662 Exiv2 now stores Exif UNICODE user-comments in UCS-2 (using the byte-order of the Exif data and without a BOM). The API expects and returns Exif UNICODE user-comments in UTF-8. The behaviour of Exif ASCII, JIS and UNDEFINED user-comments remains unchanged.
Sound like we need to adapt libkexiv2 code. Right ?
Probably yes. Assuming libkexiv2 has an interface like
void setUserComment(const std::string& comment);
to set an Exif user-comment, then the comment passed in should always be UTF-8 encoded now. The function can then simply set an Exif UNICODE user-comment, or, as suggested above somewhere, analyse the comment and use an Exif ASCII user-comment if the text is (7-Bit) ASCII only and an UNICODE user-comment if not.
That's about Exif.Photo.UserComment, isn't it?
What about backwards compatibility: Up to now we always passed UCS-2. Should we keep this behavior for older Exiv2 versions? If yes, from exactly which version number on should we pass UTF-8?
Yes, for old versions you need to continue to pass UCS-2. The UTF-8 interface will be in version 0.20 (in case there are any 0.19.x versions these will be backward compatible and not contain this change).
SVN commit 1079054 by mwiesweg:
For libexiv2 0.20, use UTF-8 for Unicode Exif UserComments.
Needs to be tested when a exiv2 library containing the fix is released.
M +6 -0 kexiv2_p.cpp
M +8 -0 kexiv2exif.cpp
WebSVN link: http://websvn.kde.org/?view=rev&revision=1079054
This file still valid using digiKam 2.x serie ?
Sorry for the delay, will check soon.
Yes, looks as if this bug is still present.
The following is a hex dump excerpt from a file with I comment I just saved using digiKam 2.1 (Kubuntu 11.10-package).
000009f0 00 00 41 53 43 49 49 00 00 00 54 65 73 74 62 fc |..ASCII...Testbü|
00000a00 6c 64 2d 43 6f 6d 6d 65 6e 74 00 f8 2a 00 e1 00 |ld-Comment.ø*.á.|
It's still tagges as "ASCII" but stored with characters outside of the ASCII range.
Git commit ada6ac69a5301f380d2f3dc98ce8678c480a6309 by Marcel Wiesweg.
Committed on 03/06/2012 at 18:42.
Pushed by mwiesweg into branch 'master'.
Fix detection of true 7bit ASCII.
The Qt Latin1 codec will only tell us if the characters are in the 8bit ISO-8859-1 range.
Use the C function isascii to test if the characters are in the true 7bit ASCII range.
M +32 -12 libkexiv2/kexiv2exif.cpp