Bug 111232 - ID3v2 toCString(true) results in double UTF8 conversion
Summary: ID3v2 toCString(true) results in double UTF8 conversion
Status: RESOLVED NOT A BUG
Alias: None
Product: taglib
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: unspecified
Platform: Debian stable Linux
: NOR normal
Target Milestone: ---
Assignee: Scott Wheeler
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-21 19:16 UTC by Tomas Simonaitis
Modified: 2005-08-21 21:54 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In:


Attachments
Test code (592 bytes, text/x-c++src)
2005-08-21 20:38 UTC, Tomas Simonaitis
Details
Upd test (846 bytes, text/x-c++src)
2005-08-21 21:14 UTC, Tomas Simonaitis
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tomas Simonaitis 2005-08-21 19:16:31 UTC
Version:           1.4 (using KDE KDE 3.3.2)
Installed from:    Debian stable Packages
OS:                Linux

If ID3v2 tag already holds UTF-8 string, toCString(true) does unnecessary conversion.
This breaks applications using toCString(true) or TStringToQString
(and expecting UTF-8 string).

To reproduce:
Set tag to say "ž" (2 bytes), toCString(true) will return "ž" (4 bytes).
(toCString(false) will return corrent UTF-8 string))

{amarok which uses TStringToQString will show invalid tag}.
Comment 1 Scott Wheeler 2005-08-21 20:02:45 UTC
I don't quite understand what you're describing -- if an ID3v2 tag contains data in UTF-8 format then it is converted to UTF-16 when the tag is read.  If you then use toCString(true) it is then converted back to UTF-8 on the way out.

Possibly what you're seeing is something that has UTF-8 data in the tag, but it's appropriately marked as being UTF-8 (i.e. probably marked as using ISO8859-1) and then the conversion functions don't work properly.

If you send me one of the files that you're having problems with by email (using the filename 111232.mp3) I'll confirm.
Comment 2 Scott Wheeler 2005-08-21 20:03:35 UTC
Oh, also forgot to note that JuK uses the same conversion functions and I regularly have "extended" characters in my tags...
Comment 3 Tomas Simonaitis 2005-08-21 20:38:19 UTC
Created attachment 12306 [details]
Test code

Running this code (with mp3 file arg.) results in:
input='ž' (2 bytes)
title(true)='ž' (4 bytes) // this, I think, is wrong
title(false)='ž' (2 bytes)
Comment 4 Tomas Simonaitis 2005-08-21 20:51:53 UTC
I've "stripped" tested mp3 file with:
id3v2 -D <mp3>

However, if somehow encoding information is preserved, maybe taglib should overwrite it when setting tag then?
Comment 5 Scott Wheeler 2005-08-21 20:54:46 UTC
Your code is wrong though -- you're just using the default (implicit) constructor for a TagLib::String, which assumes that the data is encoded in ISO-8859-1.

If you switch the line:

fs.tag()->setTitle(utf8);

to:

fs.tag()->setTitle(TagLib::String(utf8, TagLib::String::UTF8));

Then it should work fine.

Also note that if you're setting the information with the "id3v2" command line tool then it doesn't accept UTF-8 input.  It uses id3lib, which is limited to ID3v2.3 which in turn is limited to ISO-8859-1 and UTF-16.

In a nutshell you're writing invalid tags -- TagLib just gives them back to you that way.  :-)
Comment 6 Tomas Simonaitis 2005-08-21 21:14:25 UTC
Created attachment 12307 [details]
Upd test

I've changed it, but result now are:
input='ž' (2 bytes)
taglib str(true)='ž' (2 bytes)
taglib str(false)='~' (1 bytes)
title(true)='~' (1 bytes)
title(false)='~' (1 bytes)

I'm actually only reading tags with taglib, 
however amarok fails to set them, and it seems it suffers from the same utf8
problem.

They use:
#define strip( x ) TStringToQString( x ).stripWhiteSpace()
m_title   = strip( tag->title() );
to read tag and:
t->setTitle( QStringToTString( mb.title() ) );
to set it.
Comment 7 Scott Wheeler 2005-08-21 21:26:34 UTC
Sorry, one additional missing line:

TagLib::ID3v2::FrameFactory::instance()->setDefaultTextEncoding(TagLib::String::UTF8);

I tested with that and it works.

In TagLib 2.0 I'll reconsider the default encoding (i.e. making it UTF-8), but at the time that 1.0 was written most consoles still defaulted to local encodings rather than UTF-8.
Comment 8 Tomas Simonaitis 2005-08-21 21:54:59 UTC
Thank You.
Tested on amarok and it works fine.