Summary: | ID3v2 toCString(true) results in double UTF8 conversion | ||
---|---|---|---|
Product: | [Frameworks and Libraries] taglib | Reporter: | Tomas Simonaitis <haden> |
Component: | general | Assignee: | Scott Wheeler <wheeler> |
Status: | RESOLVED NOT A BUG | ||
Severity: | normal | ||
Priority: | NOR | ||
Version: | unspecified | ||
Target Milestone: | --- | ||
Platform: | Debian stable | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
Test code
Upd test |
Description
Tomas Simonaitis
2005-08-21 19:16:31 UTC
I don't quite understand what you're describing -- if an ID3v2 tag contains data in UTF-8 format then it is converted to UTF-16 when the tag is read. If you then use toCString(true) it is then converted back to UTF-8 on the way out. Possibly what you're seeing is something that has UTF-8 data in the tag, but it's appropriately marked as being UTF-8 (i.e. probably marked as using ISO8859-1) and then the conversion functions don't work properly. If you send me one of the files that you're having problems with by email (using the filename 111232.mp3) I'll confirm. Oh, also forgot to note that JuK uses the same conversion functions and I regularly have "extended" characters in my tags... Created attachment 12306 [details]
Test code
Running this code (with mp3 file arg.) results in:
input='ž' (2 bytes)
title(true)='ž' (4 bytes) // this, I think, is wrong
title(false)='ž' (2 bytes)
I've "stripped" tested mp3 file with: id3v2 -D <mp3> However, if somehow encoding information is preserved, maybe taglib should overwrite it when setting tag then? Your code is wrong though -- you're just using the default (implicit) constructor for a TagLib::String, which assumes that the data is encoded in ISO-8859-1. If you switch the line: fs.tag()->setTitle(utf8); to: fs.tag()->setTitle(TagLib::String(utf8, TagLib::String::UTF8)); Then it should work fine. Also note that if you're setting the information with the "id3v2" command line tool then it doesn't accept UTF-8 input. It uses id3lib, which is limited to ID3v2.3 which in turn is limited to ISO-8859-1 and UTF-16. In a nutshell you're writing invalid tags -- TagLib just gives them back to you that way. :-) Created attachment 12307 [details]
Upd test
I've changed it, but result now are:
input='ž' (2 bytes)
taglib str(true)='ž' (2 bytes)
taglib str(false)='~' (1 bytes)
title(true)='~' (1 bytes)
title(false)='~' (1 bytes)
I'm actually only reading tags with taglib,
however amarok fails to set them, and it seems it suffers from the same utf8
problem.
They use:
#define strip( x ) TStringToQString( x ).stripWhiteSpace()
m_title = strip( tag->title() );
to read tag and:
t->setTitle( QStringToTString( mb.title() ) );
to set it.
Sorry, one additional missing line: TagLib::ID3v2::FrameFactory::instance()->setDefaultTextEncoding(TagLib::String::UTF8); I tested with that and it works. In TagLib 2.0 I'll reconsider the default encoding (i.e. making it UTF-8), but at the time that 1.0 was written most consoles still defaulted to local encodings rather than UTF-8. Thank You. Tested on amarok and it works fine. |