Bug 114211

Summary: image comments encoding unreadable after moving an album
Product: [Applications] digikam Reporter: Nadav Kavalerchik <nadavkav>
Component: Metadata-EngineAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED FIXED    
Severity: normal CC: caulier.gilles
Priority: NOR    
Version: 0.8.0   
Target Milestone: ---   
Platform: Fedora RPMs   
OS: Linux   
Latest Commit: Version Fixed In: 0.9.0
Sentry Crash Report:

Description Nadav Kavalerchik 2005-10-11 08:49:07 UTC
Version:           0.8.0-beta2 (using KDE KDE 3.4.90)
Installed from:    Fedora RPMs

i was moving a folder/album of images from digikam's 'root image folder' somewhere else and then linked it back to the 'root image folder' .  i opened digikam and saw that most of the comments changed to something unreadable like "ק×*מפ×*ס×~" instead of showing plain hebrew fonts. ( i think it's an encoding conversion problem ) i'm almost sure it was stored in utf-8 originally. ( the comment were originally inserted using digikam's dialog ) some other comments are not shown at all.

viewing the comment from inside the meta-info tab in konqueror's properties dialog  shows the correct hebrew fonts/encoding.
(so i think it's only in the digikam's db)
Comment 1 Tom Albers 2005-10-11 17:44:37 UTC
Have you perhaps moved the digikam3.db as well, and forgot to link it back? Is it on nfs now perhaps?

Could you install the sqlite3 command line client and do the following: sqlite3 /path/to/images/digikam3.db
Then you get a sqlite> prompt, enter "select * from settings;" (without the quotes) and paste the result here. 
Comment 2 Nadav Kavalerchik 2005-10-12 08:41:38 UTC
i have not moved the db file.
I've just moved a folder to a different physical disk on the same machine.

sqlite3 ./digikam3.db
SQLite version 3.2.7
Enter ".help" for instructions
sqlite> select * from settings;
DBVersion|1
Locale|UTF-8
UpgradedFromSqlite2|yes
Scanned|2005-10-12T08:31:10
Comment 3 Tom Albers 2005-10-12 09:31:03 UTC
Can you also paste the outpuut of:
cat .kde/share/config/digikamrc  | grep Local

and paste the output of the command locale
Comment 4 Nadav Kavalerchik 2005-10-12 10:15:44 UTC
Locale=en_US.UTF-8
Comment 5 Tom Albers 2005-10-12 10:30:29 UTC
what does "locale" say when you type that in a konsole?
Comment 6 Nadav Kavalerchik 2005-10-12 16:00:28 UTC
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Comment 7 Marcel Wiesweg 2006-04-23 22:53:52 UTC
From your description, I assume did you did the move externally and then opened digikam?
In this case, does the problem also occur (test with a single image) if you copy/move a file to a different album from inside digikam?
Then, does it happen if you right-click -> copy, then right-click -> paste (choose different filename)?
Comment 8 Nadav Kavalerchik 2006-04-24 16:50:17 UTC
ok. here is what i did...
i've copy (also tried to move) from within the digikam images with hebrew comments, and all works just fine.
now, when i open the moved/copied images in konqueror and view the meta-info tab (properties view) i get the hebrew text correctly.
moving images (using the konqueror) and then checking the text comments inside digikam shows unreadable chars. (example:  × ×× ×××× ×).
so... what i asuume is that digikam is reading the hebrew text using the wrong encoding from the exif comment. (maybe reading hebrew iso-8859-8 and converting it to utf-8, while it is in utf-8 in first place. it's only and idea ! )

kindly,
nadav :-)
btw. amazing applications !!!
Comment 9 Marcel Wiesweg 2006-04-24 18:36:37 UTC
You are right, when moving inside digikam, the database entries are just copied. When the move is done outside digikam, digikam will find new images and re-read the exif comments. So it is a DMetaData/libexiv2 problem, but possibly the specifications are the actual problem (I think IPTC only knows latin encoding).
Testing this here, I set some hebrew and arab text from the internet as comment, and it works well inside the db, but I cannot read it from Konqueror or the metadata tabs (no utf8 locale?).

Anyway, in setImageComment, all comments are converted to latin1. I do not know enough about libexiv2 and the specs to go further into this.
Comment 10 Nadav Kavalerchik 2006-04-24 21:42:01 UTC
can you open/move the bug report to libexiv2 project ?
Comment 11 Marcel Wiesweg 2006-04-24 22:13:21 UTC
I would rather move this bug to Gilles :-)
Maybe this is the fault of our own DMetadata, maybe libexiv2 does not touch the strings at all.
Bug 120241 may be the same problem as this one.
Comment 12 Nadav Kavalerchik 2006-04-24 22:28:57 UTC
excellent :-)

thanks.
Comment 13 caulier.gilles 2006-04-25 14:39:20 UTC
Marcel,

To clarify how Comments are embedded into JPEG files, i will describe this feature. There is 3 ways to embedded a comment in JPEG :

1/ the first is the JFIF COM section. 

Important : this section is outside EXIF/IPTC/XMP section. Unlike konqueror file properties said, the comments don't come from Exif but JFIF in this dialog !!! (Perhaps this problem have been fixed in KDE 3.5.x).

There is no size limitation know but i don't know if UTF8 is supported here. This way have always been used by digikam since 0.7.x release to store comments in JPEG file.

2/ The second is the Exif UserComments tag. Like Exiv2 documentation said, this field support UTF8. In trunk, the DMetadata class implementation need to be fixed following this example.

//-------------------------------------------------------------
00029 Exiv2::Image::AutoPtr image = Exiv2::ImageFactory::open(argv[1]);
00030     assert (image.get() != 0);
00031     image->readMetadata();
00032     Exiv2::ExifData &exifData = image->exifData();
00033 
00034     /*
00035      Exiv2 uses a CommentValue for Exif user comments. The format of the
00036      comment string includes an optional charset specification at the beginning:
00037 
00038      [charset=["]Ascii|Jis|Unicode|Undefined["] ]comment
00039 
00040      Undefined is used as a default if the comment doesn't start with a charset
00041      definition.
00042 
00043      Following are a few examples of valid comments. The last one is written to
00044      the file.
00045     */
00046     exifData["Exif.Photo.UserComment"]
00047         = "charset=\"Unicode\" An Unicode Exif comment added with Exiv2";
00048     exifData["Exif.Photo.UserComment"]
00049         = "charset=\"Undefined\" An undefined Exif comment added with Exiv2";
00050     exifData["Exif.Photo.UserComment"]
00051         = "Another undefined Exif comment added with Exiv2";
00052     exifData["Exif.Photo.UserComment"]
00053         = "charset=Ascii An ASCII Exif comment added with Exiv2";
00054 
00055     std::cout << "Writing user comment '"
00056               << exifData["Exif.Photo.UserComment"]
00057               << "' back to the image\n";
00058 
00059     image->writeMetadata();
//-------------------------------------------------------------

3/ The last is IPTC caption tag : do not support UTF8 (only ascii latin1) and limited to 2000 charactors.

In fact, to solve definitivly UTF8 problem with metadata in the future, we need to support XMP metadata. Exiv2 will support XMP in the future:

http://dev.robotbattle.com/bugs/view_all_bug_page.php (select Exiv2 project to the top-right of the page)

We need to be patient (:=)))...

Some URL :

http://park2.wakwak.com/~tsuruzoh/Computer/Digicams/exif-e.html
http://www.iptc.org/IIM/4.1/specification/IIMV4.1.pdf
http://www.adobe.com/products/xmp/main.html

Gilles Caulier
Comment 14 Marcel Wiesweg 2006-05-02 18:36:24 UTC
So we can do something about 1) and 2), not 3). Most important is 1), because the JFIF comment is read first, so when digikam reads any comment written by digikam, this one is used.

1) JFIF comment: Konqueror (the Jpeg KFilePlugin from kdegraphics) reads the JFIF comment as utf8.
I could not find any spec, and I do not know how other apps do it, but reading and writing utf8 here is the most easy solution.

2) EXIF comment: We need to interpret charset information from exiv2, I have some code for this on my computer. 
One question is what we do with the undefined charset: Ascii or local8Bit? Spec is just as specific on the question ("Although the possibility of unreadable characters exists, display of these characters is left as a matter of reader implementation.")
Second question is whether we write the EXIF comment as Unicode (I think ucs2, not utf8). Other choices are Ascii or Undefined. Said KFilePlugin does not support unicode (reading ascii, but the value is not used at all I think). Which other apps could I test this with? 
Comment 15 Marcel Wiesweg 2006-05-21 17:32:51 UTC
SVN commit 543272 by mwiesweg:

Add some autodetection magic for charset support

- DMetadata::detectEncodingAndDecode will check if a given string
  is in UTF8. If not, it will leave it to QTextCodec to decide
  if the local charset or latin1 will be used
- use detectEncodingAndDecode when reading the JFIF comment
  and for Exif comments with undefined encoding
- When writing the Exif comment, use UCS-2 only when
  necessary. Check with QTextCodec::canEncode if plain
  latin1 is enough.

I have tested this successfully with some Arabian and cyrillic characters.
But please test this with some more pictures. UTF-8 should be no problem,
but the local8Bit vs. latin1 decision may be.

CCBUGS: 120241, 114211



 M  +75 -15    dmetadata.cpp  
 M  +3 -0      dmetadata.h  


--- trunk/extragear/graphics/digikam/libs/dmetadata/dmetadata.cpp #543271:543272
@@ -33,7 +33,9 @@
 
 // KDE includes.
 
+#include <kapplication.h>
 #include <kdebug.h>
+#include <kstringhandler.h>
 #include <ktempfile.h>
 
 // Exiv2 includes.
@@ -714,7 +716,7 @@
 
         // In first we trying to get image comments, outside of Exif and IPTC.
 
-        QString comments = QString::fromUtf8(d->imageComments.c_str());
+        QString comments = detectEncodingAndDecode(d->imageComments);
 
         if (!comments.isEmpty())
             return comments;
@@ -780,18 +782,32 @@
 
         // In Second we write comments into Exif.
 
-        // Be aware that we are dealing with a UCS-2 string.
-        // Null termination means \0\0, strlen does not work,
-        // do not use any const-char*-only methods,
-        // pass a std::string and not a const char * to ExifDatum::operator=().
-        const unsigned short *ucs2 = comment.ucs2();
-        std::string exifComment("charset=\"Unicode\" ");
-        exifComment.append((const char*)ucs2, sizeof(unsigned short) * comment.length());
-        d->exifMetadata["Exif.Photo.UserComment"] = exifComment;
-        //d->exifMetadata["Exif.Photo.UserComment"] = comment.latin1();
+        // Write as Unicode only when necessary.
+        QTextCodec *latin1Codec = QTextCodec::codecForName("iso8859-1");
+        if (latin1Codec->canEncode(comment))
+        {
+            // write as ASCII
+            std::string exifComment("charset=\"Ascii\" ");
+            exifComment += comment.latin1();
+            d->exifMetadata["Exif.Photo.UserComment"] = exifComment;
+        }
+        else
+        {
+            // write as Unicode (UCS-2)
 
-        // In Third we write comments into Iptc. Note that Caption IPTC tag is limited to 2000 char.
+            // Be aware that we are dealing with a UCS-2 string.
+            // Null termination means \0\0, strlen does not work,
+            // do not use any const-char*-only methods,
+            // pass a std::string and not a const char * to ExifDatum::operator=().
+            const unsigned short *ucs2 = comment.ucs2();
+            std::string exifComment("charset=\"Unicode\" ");
+            exifComment.append((const char*)ucs2, sizeof(unsigned short) * comment.length());
+            d->exifMetadata["Exif.Photo.UserComment"] = exifComment;
+        }
 
+        // In Third we write comments into Iptc.
+        // Note that Caption IPTC tag is limited to 2000 char and ASCII charset.
+
         QString commentIptc = comment;
         commentIptc.truncate(2000);
         d->iptcMetadata["Iptc.Application2.Caption"] = commentIptc.latin1();
@@ -815,7 +831,7 @@
     {
         std::string comment = exifDatum.toString();
         std::string charset;
-    
+
         // libexiv2 will prepend "charset=\"SomeCharset\" " if charset is specified
         // Before conversion to QString, we must know the charset, so we stay with std::string for a while
         if (comment.length() > 8 && comment.substr(0, 8) == "charset=")
@@ -830,7 +846,7 @@
                 comment = comment.substr(pos+1);
             }
         }
-    
+
         if (charset == "\"Unicode\"")
         {
             // QString expects a null-terminated UCS-2 string.
@@ -849,8 +865,7 @@
         }
         else
         {
-            // or from local8bit ??
-            return QString::fromLatin1(comment.c_str());
+            return detectEncodingAndDecode(comment);
         }
     }
     catch( Exiv2::Error &e )
@@ -863,6 +878,51 @@
     return QString();
 }
 
+QString DMetadata::detectEncodingAndDecode(const std::string &value)
+{
+    // For charset autodetection, we could use sophisticated code
+    // (Mozilla chardet, KHTML's autodetection, QTextCodec::codecForContent),
+    // but that is probably too much.
+    // We check for UTF8, Local encoding and ASCII.
+
+    if (value.empty())
+        return QString();
+
+#if KDE_IS_VERSION(3,2,0)
+    if (KStringHandler::isUtf8(value.c_str()))
+    {
+        return QString::fromUtf8(value.c_str());
+    }
+#else
+    // anyone who is still running KDE 3.0 or 3.1 is missing so many features
+    // that he will have to accept this missing feature.
+    return QString::fromUtf8(value.c_str());
+#endif
+
+    // Utf8 has a pretty unique byte pattern.
+    // Thats not true for ASCII, it is not possible
+    // to reliably autodetect different ISO-8859 charsets.
+    // We try if QTextCodec can decide here, otherwise we use Latin1.
+    // Or use local8Bit as default?
+
+    // load QTextCodecs
+    QTextCodec *latin1Codec = QTextCodec::codecForName("iso8859-1");
+    //QTextCodec *utf8Codec   = QTextCodec::codecForName("utf8");
+    QTextCodec *localCodec  = QTextCodec::codecForLocale();
+
+    // make heuristic match
+    int latin1Score = latin1Codec->heuristicContentMatch(value.c_str(), value.length());
+    int localScore  = localCodec->heuristicContentMatch(value.c_str(), value.length());
+
+    // convert string:
+    // Use whatever has the larger score, local or ASCII
+    if (localScore >= 0 && localScore >= latin1Score)
+        return localCodec->toUnicode(value.c_str(), value.length());
+    else
+        return QString::fromLatin1(value.c_str());
+}
+
+
 /*
 Iptc.Application2.Urgency <==> digiKam Rating links:
 
--- trunk/extragear/graphics/digikam/libs/dmetadata/dmetadata.h #543271:543272
@@ -21,6 +21,8 @@
 #ifndef DMETADATA_H
 #define DMETADATA_H
 
+#include <string>
+
 // QT includes.
 
 #include <qcstring.h>
@@ -108,6 +110,7 @@
     PhotoInfoContainer getPhotographInformations() const;
 
     static QString convertCommentValue(const Exiv2::Exifdatum &comment);
+    static QString detectEncodingAndDecode(const std::string &value);
 
 private:
 
Comment 16 Marcel Wiesweg 2006-05-22 20:46:05 UTC
For JFIF, we now have support for reading, autodetecting and writing comments as UTF8. For Exif, we interpret all charsets that are specified and use the same autodetection for unspecified charsets. Closing this bug.