111232 – ID3v2 toCString(true) results in double UTF8 conversion

Bug 111232 - ID3v2 toCString(true) results in double UTF8 conversion

Summary: ID3v2 toCString(true) results in double UTF8 conversion

Status:	RESOLVED NOT A BUG

Alias:	None

Product:	taglib
Classification:	Frameworks and Libraries
Component:	general (show other bugs)
Version:	unspecified
Platform:	Debian stable Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Scott Wheeler

URL:
Keywords:

Depends on:
Blocks:

Reported:	2005-08-21 19:16 UTC by Tomas Simonaitis
Modified:	2005-08-21 21:54 UTC (History)
CC List:	0 users

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Test code (592 bytes, text/x-c++src) 2005-08-21 20:38 UTC, Tomas Simonaitis	Details
Upd test (846 bytes, text/x-c++src) 2005-08-21 21:14 UTC, Tomas Simonaitis	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Tomas Simonaitis 2005-08-21 19:16:31 UTC

Version:           1.4 (using KDE KDE 3.3.2)
Installed from:    Debian stable Packages
OS:                Linux

If ID3v2 tag already holds UTF-8 string, toCString(true) does unnecessary conversion.
This breaks applications using toCString(true) or TStringToQString
(and expecting UTF-8 string).

To reproduce:
Set tag to say "ž" (2 bytes), toCString(true) will return "Å¾" (4 bytes).
(toCString(false) will return corrent UTF-8 string))

{amarok which uses TStringToQString will show invalid tag}.

Comment 1 Scott Wheeler 2005-08-21 20:02:45 UTC

I don't quite understand what you're describing -- if an ID3v2 tag contains data in UTF-8 format then it is converted to UTF-16 when the tag is read.  If you then use toCString(true) it is then converted back to UTF-8 on the way out.

Possibly what you're seeing is something that has UTF-8 data in the tag, but it's appropriately marked as being UTF-8 (i.e. probably marked as using ISO8859-1) and then the conversion functions don't work properly.

If you send me one of the files that you're having problems with by email (using the filename 111232.mp3) I'll confirm.

Comment 2 Scott Wheeler 2005-08-21 20:03:35 UTC

Oh, also forgot to note that JuK uses the same conversion functions and I regularly have "extended" characters in my tags...

Comment 3 Tomas Simonaitis 2005-08-21 20:38:19 UTC

Created attachment 12306 [details]
Test code

Running this code (with mp3 file arg.) results in:
input='ž' (2 bytes)
title(true)='Å¾' (4 bytes) // this, I think, is wrong
title(false)='ž' (2 bytes)

Comment 4 Tomas Simonaitis 2005-08-21 20:51:53 UTC

I've "stripped" tested mp3 file with:
id3v2 -D <mp3>

However, if somehow encoding information is preserved, maybe taglib should overwrite it when setting tag then?

Comment 5 Scott Wheeler 2005-08-21 20:54:46 UTC

Your code is wrong though -- you're just using the default (implicit) constructor for a TagLib::String, which assumes that the data is encoded in ISO-8859-1.

If you switch the line:

fs.tag()->setTitle(utf8);

to:

fs.tag()->setTitle(TagLib::String(utf8, TagLib::String::UTF8));

Then it should work fine.

Also note that if you're setting the information with the "id3v2" command line tool then it doesn't accept UTF-8 input.  It uses id3lib, which is limited to ID3v2.3 which in turn is limited to ISO-8859-1 and UTF-16.

In a nutshell you're writing invalid tags -- TagLib just gives them back to you that way.  :-)

Comment 6 Tomas Simonaitis 2005-08-21 21:14:25 UTC

Created attachment 12307 [details]
Upd test

I've changed it, but result now are:
input='ž' (2 bytes)
taglib str(true)='ž' (2 bytes)
taglib str(false)='~' (1 bytes)
title(true)='~' (1 bytes)
title(false)='~' (1 bytes)

I'm actually only reading tags with taglib, 
however amarok fails to set them, and it seems it suffers from the same utf8
problem.

They use:
#define strip( x ) TStringToQString( x ).stripWhiteSpace()
m_title   = strip( tag->title() );
to read tag and:
t->setTitle( QStringToTString( mb.title() ) );
to set it.

Comment 7 Scott Wheeler 2005-08-21 21:26:34 UTC

Sorry, one additional missing line:

TagLib::ID3v2::FrameFactory::instance()->setDefaultTextEncoding(TagLib::String::UTF8);

I tested with that and it works.

In TagLib 2.0 I'll reconsider the default encoding (i.e. making it UTF-8), but at the time that 1.0 was written most consoles still defaulted to local encodings rather than UTF-8.

Comment 8 Tomas Simonaitis 2005-08-21 21:54:59 UTC

Thank You.
Tested on amarok and it works fine.