Bug 120241

Summary: utf8 display and edit
Product: [Applications] digikam Reporter: Jean-Daniel Dodin <jdd>
Component: Tags-CaptionsAssignee: Digikam Developers <digikam-bugs-null>
Status: RESOLVED FIXED    
Severity: normal CC: ach, caulier.gilles
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: openSUSE   
OS: Linux   
Latest Commit: Version Fixed In: 0.9.0
Attachments: fixed caption encoding when loading from jpeg exif

Description Jean-Daniel Dodin 2006-01-16 11:52:11 UTC
Version:            (using KDE KDE 3.4.2)
Installed from:    SuSE RPMs
OS:                Linux

I use suse linux OSS 10.0. This distribution uses utf8 LANG=fr_FR.UTF8 in my case.
I use to write comments in exifs.

all goes well. comments are written, konqueror displays them.

but if I use konqueror to copy a photo from a folder to an other, digikam do no more display correctly the utf8 characters (displays Valérie in place of Valérie), even in the edit comments utility

and worst when exporting to html the same Valérie is exported :-(
<div align="center">Valérie et Virginie</div>
Comment 1 Jean-Daniel Dodin 2006-01-17 11:33:05 UTC
in 0.8.1 (svn compiled locally),
IN digikam

* right clic copy/paste: the utf8 is cripled
* mouse clic ans shift then copy: utf8 is _not_ cripled

jdd
Comment 2 Mikolaj Machowski 2006-01-17 13:26:08 UTC
> I use suse linux OSS 10.0. This distribution uses utf8 LANG=fr_FR.UTF8
> in my case. I use to write comments in exifs.


Confirming problem for trunk, Mandriva 2005LE, whole KDE from SVN 3.5
branch (digiKam from trunk - 0.9svn). LANG=pl_PL (encoding iso-8859-2).

When copying images with comments from one album to another non latin1
letters are broken. Looks like utf-8 bits are displayed directly in
iso-8859-2.
Comment 3 sero4linux 2006-03-28 17:44:56 UTC
I can not confirm your utf8 problems here on Gentoo Linux with digikam 0.9.0 SVN and LANG=de_DE.UTF-8. However I found another strange behaviour while trying to reproduce it: after I copied the image to another dir/album with konqueror or inside digikam the comment gets completely lost (although I checks the "embedding the comments in exif").

I will have a look on it the next days cause Gilles is currently changing the internals of digikam that deal with exif data. Maybe my problem is related.  

Can you please provide an example image?
Comment 4 caulier.gilles 2006-03-28 18:29:48 UTC
Yes sebastian, let's me finish to remove libKExif depency from digiKam core and we will hack this problem using trunk branch.

There are some working hours to use Exiv2 instead libKexif into digiKam core at all. I think completed this task this week. 

Remember me next week (:=)))...

Gilles Caulier
Comment 5 caulier.gilles 2006-04-03 16:27:37 UTC
The core metadata class is now updated. Please try agian using trunk svn branch implementation. Thanks in advance

Gilles Caulier
Comment 6 Mikolaj Machowski 2006-04-04 00:31:48 UTC
Works for me. All old broken comments are now showed properly, also
moving of images between albums don't destroy comments. (.9svn)
Comment 7 caulier.gilles 2006-04-04 13:21:28 UTC
*** Bug 98462 has been marked as a duplicate of this bug. ***
Comment 8 Marcel Wiesweg 2006-05-05 23:17:26 UTC
SVN commit 537807 by mwiesweg:

Unicode support for JFIF and EXIF comments:

- use UTF8 for JFIF comment
- use Unicode (UCS-2) to write JPEG UserComment,
  support charset specification when reading the UserComment
   - add convertCommentValue method to DMetaData

Using UTF8 for JFIF is simple and easy and should work.

The UCS-2 support needs testing (and a decision if we
always want to write Unicode, or a way to find out when we need to
and when we can as well write ASCII)

CCBUG: 120241 114211


 M  +84 -28    dmetadata/dmetadata.cpp  
 M  +3 -0      dmetadata/dmetadata.h  
 M  +11 -3     widgets/metadata/exifwidget.cpp  


--- trunk/extragear/graphics/digikam/libs/dmetadata/dmetadata.cpp #537806:537807
@@ -28,6 +28,7 @@
 // Qt includes.
 
 #include <qfile.h>
+#include <qtextcodec.h>
 #include <qwmatrix.h>
 
 // KDE includes.
@@ -635,43 +636,46 @@
 QString DMetadata::getImageComment() const
 {
     try
-    {    
+    {
+        if (d->filePath.isEmpty())
+            return QString();
+
         // In first we trying to get image comments, outside of Exif and IPTC.
 
-        QString comments(d->imageComments.c_str());
-        
+        QString comments = QString::fromUtf8(d->imageComments.c_str());
+
         if (!comments.isEmpty())
-           return comments;
-           
-        // In second, we trying to get Exif comments   
-                
+            return comments;
+
+        // In second, we trying to get Exif comments
+
         if (!d->exifMetadata.empty())
         {
             Exiv2::ExifKey key("Exif.Photo.UserComment");
             Exiv2::ExifData exifData(d->exifMetadata);
             Exiv2::ExifData::iterator it = exifData.findKey(key);
-            
+
             if (it != exifData.end())
             {
-                QString ExifComment(it->toString().c_str());
-    
-                if (!ExifComment.isEmpty())
-                  return ExifComment;
+                QString exifComment = convertCommentValue(*it);
+
+                if (!exifComment.isEmpty())
+                  return exifComment;
             }
         }
-        
-        // In third, we trying to get IPTC comments   
-                
+
+        // In third, we trying to get IPTC comments
+
         if (!d->iptcMetadata.empty())
         {
             Exiv2::IptcKey key("Iptc.Application2.Caption");
             Exiv2::IptcData iptcData(d->iptcMetadata);
             Exiv2::IptcData::iterator it = iptcData.findKey(key);
-            
+
             if (it != iptcData.end())
             {
-                QString IptcComment(it->toString().c_str());
-    
+                QString IptcComment = QString::fromLatin1(it->toString().c_str());
+
                 if (!IptcComment.isEmpty())
                   return IptcComment;
             }
@@ -683,15 +687,15 @@
         kdDebug() << "Cannot get Image comments using Exiv2 (" 
                   << QString::fromLocal8Bit(e.what().c_str())
                   << ")" << endl;
-    }        
-    
+    }
+
     return QString();
 }
 
 bool DMetadata::setImageComment(const QString& comment)
 {
     try
-    {    
+    {
         if (comment.isEmpty())
             return false;
 
@@ -699,13 +703,21 @@
 
         // In first we trying to set image comments, outside of Exif and IPTC.
 
-        const std::string str(comment.latin1());
+        const std::string str(comment.utf8());
         d->imageComments = str;
 
         // In Second we write comments into Exif.
-                
-        d->exifMetadata["Exif.Photo.UserComment"] = comment.latin1();
-        
+
+        // Be aware that we are dealing with a UCS-2 string.
+        // Null termination means \0\0, strlen does not work,
+        // do not use any const-char*-only methods,
+        // pass a std::string and not a const char * to ExifDatum::operator=().
+        const unsigned short *ucs2 = comment.ucs2();
+        std::string exifComment("charset=\"Unicode\" ");
+        exifComment.append((const char*)ucs2, sizeof(unsigned short) * comment.length());
+        d->exifMetadata["Exif.Photo.UserComment"] = exifComment;
+        //d->exifMetadata["Exif.Photo.UserComment"] = comment.latin1();
+
         // In Third we write comments into Iptc. Note that Caption IPTC tag is limited to 2000 char.
 
         setImageProgramId();
@@ -713,7 +725,7 @@
         QString commentIptc = comment;
         commentIptc.truncate(2000);
         d->iptcMetadata["Iptc.Application2.Caption"] = commentIptc.latin1();
-    
+
         return true;
     }
     catch( Exiv2::Error &e )
@@ -721,11 +733,55 @@
         kdDebug() << "Cannot set Comment into image using Exiv2 (" 
                   << QString::fromLocal8Bit(e.what().c_str())
                   << ")" << endl;
-    }        
-    
+    }
+
     return false;
 }
 
+QString DMetadata::convertCommentValue(const Exiv2::Exifdatum &exifDatum)
+{
+    std::string comment = exifDatum.toString();
+    std::string charset;
+
+    // libexiv2 will prepend "charset=\"SomeCharset\" " if charset is specified
+    // Before conversion to QString, we must know the charset, so we stay with std::string for a while
+    if (comment.length() > 8 && comment.substr(0, 8) == "charset=")
+    {
+        // the prepended charset specification is followed by a blank
+        std::string::size_type pos = comment.find_first_of(' ');
+        if (pos != std::string::npos)
+        {
+            // extract string between the = and the blank
+            charset = comment.substr(8, pos-8);
+            // get the rest of the string after the charset specification
+            comment = comment.substr(pos+1);
+        }
+    }
+
+    if (charset == "\"Unicode\"")
+    {
+        // QString expects a null-terminated UCS-2 string.
+        // Is it already null terminated? In any case, add termination for safety.
+        comment += "\0\0";
+        return QString::fromUcs2((unsigned short *)comment.data());
+    }
+    else if (charset == "\"Jis\"")
+    {
+        QTextCodec *codec = QTextCodec::codecForName("JIS7");
+        return codec->toUnicode(comment.c_str());
+    }
+    else if (charset == "\"Ascii\"")
+    {
+        return QString::fromLatin1(comment.c_str());
+    }
+    else
+    {
+        // or from local8bit ??
+        return QString::fromLatin1(comment.c_str());
+    }
+}
+
+
 /*
 Iptc.Application2.Urgency <==> digiKam Rating links:
 
--- trunk/extragear/graphics/digikam/libs/dmetadata/dmetadata.h #537806:537807
@@ -30,6 +30,7 @@
 // Exiv2 includes.
 
 #include <exiv2/types.hpp>
+#include <exiv2/exif.hpp>
 
 // Local includes.
 
@@ -104,6 +105,8 @@
 
     PhotoInfoContainer getPhotographInformations() const;
 
+    static QString convertCommentValue(const Exiv2::Exifdatum &comment);
+
 private:
 
     DImg::FORMAT fileFormat(const QString& filePath);
--- trunk/extragear/graphics/digikam/libs/widgets/metadata/exifwidget.cpp #537806:537807
@@ -155,9 +155,17 @@
             QString key = QString::fromLocal8Bit(md->key().c_str());
 
             // Decode the tag value with a user friendly output.
-            std::ostringstream os;
-            os << *md;
-            QString tagValue = QString::fromLocal8Bit(os.str().c_str());
+            QString tagValue;
+            if (key == "Exif.Photo.UserComment")
+            {
+                tagValue = DMetadata::convertCommentValue(*md);
+            }
+            else
+            {
+                std::ostringstream os;
+                os << *md;
+                tagValue = QString::fromLocal8Bit(os.str().c_str());
+            }
             tagValue.replace("\n", " ");
 
             // We apply a filter to get only standard Exif tags, not maker notes.
Comment 9 caulier.gilles 2006-05-05 23:50:38 UTC
Marcel, have you find some documentations about JFIF comments encoding ?

Also, about a decision if we always want to write Unicode or ASCII, i propose to add an QCheckbox option in metadata setup dialog page. I think that Unicode must be always enable by default. 

Your viewpoint ?

Gilles
Comment 10 Cyril Sochor 2006-05-09 01:03:42 UTC
Created attachment 15985 [details]
fixed caption encoding when loading from jpeg exif
Comment 11 caulier.gilles 2006-05-09 01:16:08 UTC
SVN commit 538809 by cgilles:

digikam from stable : fix JFIF comments section encoding extraction to respect UTF8

CCMAIL: digikam-devel@kde.org
CCBUGS: 120241



 M  +1 -1      jpegmetadata.cpp  


--- branches/stable/extragear/graphics/digikam/libs/jpegutils/jpegmetadata.cpp #538808:538809
@@ -118,7 +118,7 @@
                 continue;
             }
 
-            comments = QString::fromAscii((const char*)marker->data,
+            comments = QString::fromUtf8((const char*)marker->data,
                                           marker->data_length);
         }
         else if (marker->marker == M_EXIF)
Comment 12 Marcel Wiesweg 2006-05-21 17:32:47 UTC
SVN commit 543272 by mwiesweg:

Add some autodetection magic for charset support

- DMetadata::detectEncodingAndDecode will check if a given string
  is in UTF8. If not, it will leave it to QTextCodec to decide
  if the local charset or latin1 will be used
- use detectEncodingAndDecode when reading the JFIF comment
  and for Exif comments with undefined encoding
- When writing the Exif comment, use UCS-2 only when
  necessary. Check with QTextCodec::canEncode if plain
  latin1 is enough.

I have tested this successfully with some Arabian and cyrillic characters.
But please test this with some more pictures. UTF-8 should be no problem,
but the local8Bit vs. latin1 decision may be.

CCBUGS: 120241, 114211



 M  +75 -15    dmetadata.cpp  
 M  +3 -0      dmetadata.h  


--- trunk/extragear/graphics/digikam/libs/dmetadata/dmetadata.cpp #543271:543272
@@ -33,7 +33,9 @@
 
 // KDE includes.
 
+#include <kapplication.h>
 #include <kdebug.h>
+#include <kstringhandler.h>
 #include <ktempfile.h>
 
 // Exiv2 includes.
@@ -714,7 +716,7 @@
 
         // In first we trying to get image comments, outside of Exif and IPTC.
 
-        QString comments = QString::fromUtf8(d->imageComments.c_str());
+        QString comments = detectEncodingAndDecode(d->imageComments);
 
         if (!comments.isEmpty())
             return comments;
@@ -780,18 +782,32 @@
 
         // In Second we write comments into Exif.
 
-        // Be aware that we are dealing with a UCS-2 string.
-        // Null termination means \0\0, strlen does not work,
-        // do not use any const-char*-only methods,
-        // pass a std::string and not a const char * to ExifDatum::operator=().
-        const unsigned short *ucs2 = comment.ucs2();
-        std::string exifComment("charset=\"Unicode\" ");
-        exifComment.append((const char*)ucs2, sizeof(unsigned short) * comment.length());
-        d->exifMetadata["Exif.Photo.UserComment"] = exifComment;
-        //d->exifMetadata["Exif.Photo.UserComment"] = comment.latin1();
+        // Write as Unicode only when necessary.
+        QTextCodec *latin1Codec = QTextCodec::codecForName("iso8859-1");
+        if (latin1Codec->canEncode(comment))
+        {
+            // write as ASCII
+            std::string exifComment("charset=\"Ascii\" ");
+            exifComment += comment.latin1();
+            d->exifMetadata["Exif.Photo.UserComment"] = exifComment;
+        }
+        else
+        {
+            // write as Unicode (UCS-2)
 
-        // In Third we write comments into Iptc. Note that Caption IPTC tag is limited to 2000 char.
+            // Be aware that we are dealing with a UCS-2 string.
+            // Null termination means \0\0, strlen does not work,
+            // do not use any const-char*-only methods,
+            // pass a std::string and not a const char * to ExifDatum::operator=().
+            const unsigned short *ucs2 = comment.ucs2();
+            std::string exifComment("charset=\"Unicode\" ");
+            exifComment.append((const char*)ucs2, sizeof(unsigned short) * comment.length());
+            d->exifMetadata["Exif.Photo.UserComment"] = exifComment;
+        }
 
+        // In Third we write comments into Iptc.
+        // Note that Caption IPTC tag is limited to 2000 char and ASCII charset.
+
         QString commentIptc = comment;
         commentIptc.truncate(2000);
         d->iptcMetadata["Iptc.Application2.Caption"] = commentIptc.latin1();
@@ -815,7 +831,7 @@
     {
         std::string comment = exifDatum.toString();
         std::string charset;
-    
+
         // libexiv2 will prepend "charset=\"SomeCharset\" " if charset is specified
         // Before conversion to QString, we must know the charset, so we stay with std::string for a while
         if (comment.length() > 8 && comment.substr(0, 8) == "charset=")
@@ -830,7 +846,7 @@
                 comment = comment.substr(pos+1);
             }
         }
-    
+
         if (charset == "\"Unicode\"")
         {
             // QString expects a null-terminated UCS-2 string.
@@ -849,8 +865,7 @@
         }
         else
         {
-            // or from local8bit ??
-            return QString::fromLatin1(comment.c_str());
+            return detectEncodingAndDecode(comment);
         }
     }
     catch( Exiv2::Error &e )
@@ -863,6 +878,51 @@
     return QString();
 }
 
+QString DMetadata::detectEncodingAndDecode(const std::string &value)
+{
+    // For charset autodetection, we could use sophisticated code
+    // (Mozilla chardet, KHTML's autodetection, QTextCodec::codecForContent),
+    // but that is probably too much.
+    // We check for UTF8, Local encoding and ASCII.
+
+    if (value.empty())
+        return QString();
+
+#if KDE_IS_VERSION(3,2,0)
+    if (KStringHandler::isUtf8(value.c_str()))
+    {
+        return QString::fromUtf8(value.c_str());
+    }
+#else
+    // anyone who is still running KDE 3.0 or 3.1 is missing so many features
+    // that he will have to accept this missing feature.
+    return QString::fromUtf8(value.c_str());
+#endif
+
+    // Utf8 has a pretty unique byte pattern.
+    // Thats not true for ASCII, it is not possible
+    // to reliably autodetect different ISO-8859 charsets.
+    // We try if QTextCodec can decide here, otherwise we use Latin1.
+    // Or use local8Bit as default?
+
+    // load QTextCodecs
+    QTextCodec *latin1Codec = QTextCodec::codecForName("iso8859-1");
+    //QTextCodec *utf8Codec   = QTextCodec::codecForName("utf8");
+    QTextCodec *localCodec  = QTextCodec::codecForLocale();
+
+    // make heuristic match
+    int latin1Score = latin1Codec->heuristicContentMatch(value.c_str(), value.length());
+    int localScore  = localCodec->heuristicContentMatch(value.c_str(), value.length());
+
+    // convert string:
+    // Use whatever has the larger score, local or ASCII
+    if (localScore >= 0 && localScore >= latin1Score)
+        return localCodec->toUnicode(value.c_str(), value.length());
+    else
+        return QString::fromLatin1(value.c_str());
+}
+
+
 /*
 Iptc.Application2.Urgency <==> digiKam Rating links:
 
--- trunk/extragear/graphics/digikam/libs/dmetadata/dmetadata.h #543271:543272
@@ -21,6 +21,8 @@
 #ifndef DMETADATA_H
 #define DMETADATA_H
 
+#include <string>
+
 // QT includes.
 
 #include <qcstring.h>
@@ -108,6 +110,7 @@
     PhotoInfoContainer getPhotographInformations() const;
 
     static QString convertCommentValue(const Exiv2::Exifdatum &comment);
+    static QString detectEncodingAndDecode(const std::string &value);
 
 private:
 
Comment 13 Marcel Wiesweg 2006-05-22 20:42:37 UTC
We now have support for reading, autodetecting and writing comments as UTF8. Closing this bug.