Summary: | image comments encoding unreadable after moving an album | ||
---|---|---|---|
Product: | [Applications] digikam | Reporter: | Nadav Kavalerchik <nadavkav> |
Component: | Metadata-Engine | Assignee: | Digikam Developers <digikam-bugs-null> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | caulier.gilles |
Priority: | NOR | ||
Version: | 0.8.0 | ||
Target Milestone: | --- | ||
Platform: | Fedora RPMs | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | 0.9.0 | |
Sentry Crash Report: |
Description
Nadav Kavalerchik
2005-10-11 08:49:07 UTC
Have you perhaps moved the digikam3.db as well, and forgot to link it back? Is it on nfs now perhaps? Could you install the sqlite3 command line client and do the following: sqlite3 /path/to/images/digikam3.db Then you get a sqlite> prompt, enter "select * from settings;" (without the quotes) and paste the result here. i have not moved the db file. I've just moved a folder to a different physical disk on the same machine. sqlite3 ./digikam3.db SQLite version 3.2.7 Enter ".help" for instructions sqlite> select * from settings; DBVersion|1 Locale|UTF-8 UpgradedFromSqlite2|yes Scanned|2005-10-12T08:31:10 Can you also paste the outpuut of: cat .kde/share/config/digikamrc | grep Local and paste the output of the command locale Locale=en_US.UTF-8 what does "locale" say when you type that in a konsole? LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= From your description, I assume did you did the move externally and then opened digikam? In this case, does the problem also occur (test with a single image) if you copy/move a file to a different album from inside digikam? Then, does it happen if you right-click -> copy, then right-click -> paste (choose different filename)? ok. here is what i did... i've copy (also tried to move) from within the digikam images with hebrew comments, and all works just fine. now, when i open the moved/copied images in konqueror and view the meta-info tab (properties view) i get the hebrew text correctly. moving images (using the konqueror) and then checking the text comments inside digikam shows unreadable chars. (example: × ×× ×××× ×). so... what i asuume is that digikam is reading the hebrew text using the wrong encoding from the exif comment. (maybe reading hebrew iso-8859-8 and converting it to utf-8, while it is in utf-8 in first place. it's only and idea ! ) kindly, nadav :-) btw. amazing applications !!! You are right, when moving inside digikam, the database entries are just copied. When the move is done outside digikam, digikam will find new images and re-read the exif comments. So it is a DMetaData/libexiv2 problem, but possibly the specifications are the actual problem (I think IPTC only knows latin encoding). Testing this here, I set some hebrew and arab text from the internet as comment, and it works well inside the db, but I cannot read it from Konqueror or the metadata tabs (no utf8 locale?). Anyway, in setImageComment, all comments are converted to latin1. I do not know enough about libexiv2 and the specs to go further into this. can you open/move the bug report to libexiv2 project ? I would rather move this bug to Gilles :-) Maybe this is the fault of our own DMetadata, maybe libexiv2 does not touch the strings at all. Bug 120241 may be the same problem as this one. excellent :-) thanks. Marcel, To clarify how Comments are embedded into JPEG files, i will describe this feature. There is 3 ways to embedded a comment in JPEG : 1/ the first is the JFIF COM section. Important : this section is outside EXIF/IPTC/XMP section. Unlike konqueror file properties said, the comments don't come from Exif but JFIF in this dialog !!! (Perhaps this problem have been fixed in KDE 3.5.x). There is no size limitation know but i don't know if UTF8 is supported here. This way have always been used by digikam since 0.7.x release to store comments in JPEG file. 2/ The second is the Exif UserComments tag. Like Exiv2 documentation said, this field support UTF8. In trunk, the DMetadata class implementation need to be fixed following this example. //------------------------------------------------------------- 00029 Exiv2::Image::AutoPtr image = Exiv2::ImageFactory::open(argv[1]); 00030 assert (image.get() != 0); 00031 image->readMetadata(); 00032 Exiv2::ExifData &exifData = image->exifData(); 00033 00034 /* 00035 Exiv2 uses a CommentValue for Exif user comments. The format of the 00036 comment string includes an optional charset specification at the beginning: 00037 00038 [charset=["]Ascii|Jis|Unicode|Undefined["] ]comment 00039 00040 Undefined is used as a default if the comment doesn't start with a charset 00041 definition. 00042 00043 Following are a few examples of valid comments. The last one is written to 00044 the file. 00045 */ 00046 exifData["Exif.Photo.UserComment"] 00047 = "charset=\"Unicode\" An Unicode Exif comment added with Exiv2"; 00048 exifData["Exif.Photo.UserComment"] 00049 = "charset=\"Undefined\" An undefined Exif comment added with Exiv2"; 00050 exifData["Exif.Photo.UserComment"] 00051 = "Another undefined Exif comment added with Exiv2"; 00052 exifData["Exif.Photo.UserComment"] 00053 = "charset=Ascii An ASCII Exif comment added with Exiv2"; 00054 00055 std::cout << "Writing user comment '" 00056 << exifData["Exif.Photo.UserComment"] 00057 << "' back to the image\n"; 00058 00059 image->writeMetadata(); //------------------------------------------------------------- 3/ The last is IPTC caption tag : do not support UTF8 (only ascii latin1) and limited to 2000 charactors. In fact, to solve definitivly UTF8 problem with metadata in the future, we need to support XMP metadata. Exiv2 will support XMP in the future: http://dev.robotbattle.com/bugs/view_all_bug_page.php (select Exiv2 project to the top-right of the page) We need to be patient (:=)))... Some URL : http://park2.wakwak.com/~tsuruzoh/Computer/Digicams/exif-e.html http://www.iptc.org/IIM/4.1/specification/IIMV4.1.pdf http://www.adobe.com/products/xmp/main.html Gilles Caulier So we can do something about 1) and 2), not 3). Most important is 1), because the JFIF comment is read first, so when digikam reads any comment written by digikam, this one is used. 1) JFIF comment: Konqueror (the Jpeg KFilePlugin from kdegraphics) reads the JFIF comment as utf8. I could not find any spec, and I do not know how other apps do it, but reading and writing utf8 here is the most easy solution. 2) EXIF comment: We need to interpret charset information from exiv2, I have some code for this on my computer. One question is what we do with the undefined charset: Ascii or local8Bit? Spec is just as specific on the question ("Although the possibility of unreadable characters exists, display of these characters is left as a matter of reader implementation.") Second question is whether we write the EXIF comment as Unicode (I think ucs2, not utf8). Other choices are Ascii or Undefined. Said KFilePlugin does not support unicode (reading ascii, but the value is not used at all I think). Which other apps could I test this with? SVN commit 543272 by mwiesweg: Add some autodetection magic for charset support - DMetadata::detectEncodingAndDecode will check if a given string is in UTF8. If not, it will leave it to QTextCodec to decide if the local charset or latin1 will be used - use detectEncodingAndDecode when reading the JFIF comment and for Exif comments with undefined encoding - When writing the Exif comment, use UCS-2 only when necessary. Check with QTextCodec::canEncode if plain latin1 is enough. I have tested this successfully with some Arabian and cyrillic characters. But please test this with some more pictures. UTF-8 should be no problem, but the local8Bit vs. latin1 decision may be. CCBUGS: 120241, 114211 M +75 -15 dmetadata.cpp M +3 -0 dmetadata.h --- trunk/extragear/graphics/digikam/libs/dmetadata/dmetadata.cpp #543271:543272 @@ -33,7 +33,9 @@ // KDE includes. +#include <kapplication.h> #include <kdebug.h> +#include <kstringhandler.h> #include <ktempfile.h> // Exiv2 includes. @@ -714,7 +716,7 @@ // In first we trying to get image comments, outside of Exif and IPTC. - QString comments = QString::fromUtf8(d->imageComments.c_str()); + QString comments = detectEncodingAndDecode(d->imageComments); if (!comments.isEmpty()) return comments; @@ -780,18 +782,32 @@ // In Second we write comments into Exif. - // Be aware that we are dealing with a UCS-2 string. - // Null termination means \0\0, strlen does not work, - // do not use any const-char*-only methods, - // pass a std::string and not a const char * to ExifDatum::operator=(). - const unsigned short *ucs2 = comment.ucs2(); - std::string exifComment("charset=\"Unicode\" "); - exifComment.append((const char*)ucs2, sizeof(unsigned short) * comment.length()); - d->exifMetadata["Exif.Photo.UserComment"] = exifComment; - //d->exifMetadata["Exif.Photo.UserComment"] = comment.latin1(); + // Write as Unicode only when necessary. + QTextCodec *latin1Codec = QTextCodec::codecForName("iso8859-1"); + if (latin1Codec->canEncode(comment)) + { + // write as ASCII + std::string exifComment("charset=\"Ascii\" "); + exifComment += comment.latin1(); + d->exifMetadata["Exif.Photo.UserComment"] = exifComment; + } + else + { + // write as Unicode (UCS-2) - // In Third we write comments into Iptc. Note that Caption IPTC tag is limited to 2000 char. + // Be aware that we are dealing with a UCS-2 string. + // Null termination means \0\0, strlen does not work, + // do not use any const-char*-only methods, + // pass a std::string and not a const char * to ExifDatum::operator=(). + const unsigned short *ucs2 = comment.ucs2(); + std::string exifComment("charset=\"Unicode\" "); + exifComment.append((const char*)ucs2, sizeof(unsigned short) * comment.length()); + d->exifMetadata["Exif.Photo.UserComment"] = exifComment; + } + // In Third we write comments into Iptc. + // Note that Caption IPTC tag is limited to 2000 char and ASCII charset. + QString commentIptc = comment; commentIptc.truncate(2000); d->iptcMetadata["Iptc.Application2.Caption"] = commentIptc.latin1(); @@ -815,7 +831,7 @@ { std::string comment = exifDatum.toString(); std::string charset; - + // libexiv2 will prepend "charset=\"SomeCharset\" " if charset is specified // Before conversion to QString, we must know the charset, so we stay with std::string for a while if (comment.length() > 8 && comment.substr(0, 8) == "charset=") @@ -830,7 +846,7 @@ comment = comment.substr(pos+1); } } - + if (charset == "\"Unicode\"") { // QString expects a null-terminated UCS-2 string. @@ -849,8 +865,7 @@ } else { - // or from local8bit ?? - return QString::fromLatin1(comment.c_str()); + return detectEncodingAndDecode(comment); } } catch( Exiv2::Error &e ) @@ -863,6 +878,51 @@ return QString(); } +QString DMetadata::detectEncodingAndDecode(const std::string &value) +{ + // For charset autodetection, we could use sophisticated code + // (Mozilla chardet, KHTML's autodetection, QTextCodec::codecForContent), + // but that is probably too much. + // We check for UTF8, Local encoding and ASCII. + + if (value.empty()) + return QString(); + +#if KDE_IS_VERSION(3,2,0) + if (KStringHandler::isUtf8(value.c_str())) + { + return QString::fromUtf8(value.c_str()); + } +#else + // anyone who is still running KDE 3.0 or 3.1 is missing so many features + // that he will have to accept this missing feature. + return QString::fromUtf8(value.c_str()); +#endif + + // Utf8 has a pretty unique byte pattern. + // Thats not true for ASCII, it is not possible + // to reliably autodetect different ISO-8859 charsets. + // We try if QTextCodec can decide here, otherwise we use Latin1. + // Or use local8Bit as default? + + // load QTextCodecs + QTextCodec *latin1Codec = QTextCodec::codecForName("iso8859-1"); + //QTextCodec *utf8Codec = QTextCodec::codecForName("utf8"); + QTextCodec *localCodec = QTextCodec::codecForLocale(); + + // make heuristic match + int latin1Score = latin1Codec->heuristicContentMatch(value.c_str(), value.length()); + int localScore = localCodec->heuristicContentMatch(value.c_str(), value.length()); + + // convert string: + // Use whatever has the larger score, local or ASCII + if (localScore >= 0 && localScore >= latin1Score) + return localCodec->toUnicode(value.c_str(), value.length()); + else + return QString::fromLatin1(value.c_str()); +} + + /* Iptc.Application2.Urgency <==> digiKam Rating links: --- trunk/extragear/graphics/digikam/libs/dmetadata/dmetadata.h #543271:543272 @@ -21,6 +21,8 @@ #ifndef DMETADATA_H #define DMETADATA_H +#include <string> + // QT includes. #include <qcstring.h> @@ -108,6 +110,7 @@ PhotoInfoContainer getPhotographInformations() const; static QString convertCommentValue(const Exiv2::Exifdatum &comment); + static QString detectEncodingAndDecode(const std::string &value); private: For JFIF, we now have support for reading, autodetecting and writing comments as UTF8. For Exif, we interpret all charsets that are specified and use the same autodetection for unspecified charsets. Closing this bug. |