Bug 311399 - ID3 tags with ISO 8859-1 characters wrongly encoded
Summary: ID3 tags with ISO 8859-1 characters wrongly encoded
Status: RESOLVED DOWNSTREAM
Alias: None
Product: taglib
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 1.8
Platform: openSUSE Linux
: NOR normal
Target Milestone: ---
Assignee: Scott Wheeler
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-12-09 11:12 UTC by Karl Ove Hufthammer
Modified: 2013-03-03 12:15 UTC (History)
5 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments
MP3 file with no ID3 tags (10.12 KB, audio/mpeg)
2012-12-09 11:12 UTC, Karl Ove Hufthammer
Details
MP3 file with wrongly encoded ID3 tags (11.32 KB, audio/mpeg)
2012-12-09 11:13 UTC, Karl Ove Hufthammer
Details
MP3 file with some wrongly and some correctly encoded ID3 tags (11.32 KB, audio/mpeg)
2012-12-09 11:13 UTC, Karl Ove Hufthammer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Karl Ove Hufthammer 2012-12-09 11:12:09 UTC
ID3 (v. 2.4) tags added with Amarok are wrongly encoded iff they contains characters that can fully be represented in ISO 8859-1. They are encoded as UTF-8, but the text encoding description byte is set to 00, i.e., ISO 8859-1.

I’ll attach three example files, chartest1.mp3, chartest2.mp3 and chartest3.mp3.

The chartest1.mp3 contains no ID3 tags.

In chartest2.mp3, I have added the text ‘TitleABCÆØÅ’ to selected fields in the original (chartest1.mp3) file using Amarok. Note that this string *can* be represented in ISO 8859-1. Also note that it is displayed ’correctly’ in Amarok. But when I look in the actual file, all text encoding description bytes are set to 00, indicating ISO 8859-1, while the actual character data is UTF-8 encoded.

In chartest3.mp3, I have changed two of the fields (artist and comment) to ‘ArtistΔABCÆØÅ’, i.e., containing the character ‘Δ’, that can *not* be represented in ISO 8859-1. The frames are all stored as UTF-8, but now the text encoding description bytes for the two frames containing the ‘Δ’ are correctly set to 03. The text encoding description bytes for the other strings are still incorrectly set to 00.

Summary: Amarok *always* seems to store ID3 tags (at least on files that don’t contain any ID3 tags to begin with) as ID3 v2.4 encoded as UTF-8. However, if the character data *can* be represented in ISO 8859-1, it is incorrectly being specified as actually stored in ISO 8859-1. This is clearly a bug.

Reference: http://id3.org/id3v2.4.0-structure

Reproducible: Always

Steps to Reproduce:
1. Open chartest1.mp3 in Amarok.
2. Edit the tag info in Amarok to contain ‘TestABCÆØÅ’ for a few fields and ‘TestΔABCÆØÅ’, and save the file.
3. Open the file in an hex editor and verify that that all strings are stored as UTF-8 (the Æ, Ø and Å characters should take up two bytes each, while the others should take up one byte), while the first strings are incorrectly preceded by a 00 character encoding byte and the other strings are correctly preceded by a 03 character encoding byte.
Actual Results:  
Wrongly encoded strings.

Expected Results:  
Preferably, the character encoding byte should be set to 03 for all the strings. (Another alternative would be to leave it to 00 for the strings that can be represented as ISO 8859-1, and actually encode them in this character encoding, instead of in UTF-8.)
Comment 1 Karl Ove Hufthammer 2012-12-09 11:12:51 UTC
Created attachment 75742 [details]
MP3 file with no ID3 tags
Comment 2 Karl Ove Hufthammer 2012-12-09 11:13:20 UTC
Created attachment 75743 [details]
MP3 file with wrongly encoded ID3 tags
Comment 3 Karl Ove Hufthammer 2012-12-09 11:13:51 UTC
Created attachment 75744 [details]
MP3 file with some wrongly and some correctly encoded ID3 tags
Comment 4 Danilo Luvizotto 2013-02-16 04:08:55 UTC
Tough TargetMilestone is 2.7, this problem still exist in Amarok 2.7.0.

I believe the importance of this bug is critical. A collection imported with rightly encoded tags will have them corrupted by Amarok when it adds it's own custom tags, like "FMPS_Rating_Amarok_Score0.5" or album arts. In other words, Amarok can corrupt the tags of entire collections (which can contain hundreds of files - hard to fix). The user will not know the tags are corrupted because they will be correct in the sql database but not in the file itself. If the collection is "fully rescanned", then Amarok will re-read the tags from the files and show them corrupted to the user.

By the way, very good bug description, Karl.
Comment 5 Myriam Schweingruber 2013-02-16 09:23:47 UTC
Guys, why do you insist on using an obsolete encoding system? One can very easily retag all files to Unicode with kid3 or easytag, and probably other mass tagger as well.
This is not in fact an Amarok bug but a wish because we do not support ISO encoding at all, so not implemented.

And this is not a regression since it never worked in Amarok 2,x, and not a testcase as it is not a bug, sorry.
Comment 6 Myriam Schweingruber 2013-02-16 09:27:08 UTC
Gah, I should have read correctly, sorry. it is a bug, but still not a regression as this never worked otherwise in Amarok 2.x

Danilo:not, this is not critical at all, please read up the definition of what is a critical bug.
Comment 7 Karl Ove Hufthammer 2013-02-16 09:38:49 UTC
Myriam, you have misunderstood. This is not a wishlist; it’s a bug. I hoped my bug report was clear, as I even gave an easy to reproduce test case, but I’ll try to make it even clearer:

Amarok incorrectly writes ID3 tags. Amarok writes all tags as UTF-8 (which is great), but says that they’re encoded as ISO 8859-1 *iff* they potentially could be represented as ISO 8859-1. In other words, the actual encoding and the text encoding description byte differ. This is clearly a bug. The solution is easy: Correctly set the text encoding description byte to 03 when saving the files.

We do not not insist on using an obsolete encoding. There is no option for this in Amarok, so I don’t understand your accusation. This is *not* a wish about support for any ISO encoding. I would think it wonderful if Amarok correctly saved ID3 tags only as UTF-8 (or UTF-16). Unfortunately, it sometimes doesn’t. This is a bug.

Also, this is a clearly regression, since this bug didn’t occur in earlier Amarok version (i.e., Amarok 2.5.x, I believe).
Comment 8 Rex Dieter 2013-02-16 17:39:13 UTC
This could likely a taglib issue, not an amarok one.

for those experiencing any "regression-like" behavior, did the version of taglib vary between working and non-working test environments?  

And, it would help to mention what version of taglib you have installed currently.
Comment 9 Danilo Luvizotto 2013-02-16 18:02:25 UTC
I spent some hours trying to find this bug in amarok source code but found no problems, I agree it may be a taglib problem. I'll post the version I'm using as soon as I get home.
Comment 10 Karl Ove Hufthammer 2013-02-16 18:36:15 UTC
Rex, the version of taglib may very well have varied. I now use the latest official openSUSE versions of the packages, i.e., Amarok 2.6.0 and Taglib 1.8, and the bug is at least present in these versions.

(The following is not important for the actual bug:)
BTW, the bug manifests itself in a slightly different form than described in my initial comment. When setting the tags to ‘TitleABCÆØÅ’, the wrong encoding description byte is used, but the tags are also *shown* ‘wrongly’ (or really correctly, as by the ID3 standard), as UTF-8 interpreted as ISO 8859-1 in the playlist, i.e. with garbled characters. So Amarok is now correctly *reading* the (wrongly encoded) file, making the bug easier to spot. (The reason from the different behaviour might perhaps be because the file is not in my collection, so the tags are read from the file instead of from the collection DB?)
Comment 11 Karl Ove Hufthammer 2013-02-16 19:25:05 UTC
I have also now tested this on a Kubuntu live CD with Amarok 2.4.0 and Taglib 1.6.3, and the bug was present even back then. (This is surprising, as I didn’t notice the problem before Amarok 2.6.0.)

BTW, here the behaviour was as in my initial report, i.e., the characters did *not* appear garbled in the playlist.
Comment 12 Karl Ove Hufthammer 2013-02-16 19:33:10 UTC
I have tried some googling. Could there perhaps be a missing
TagLib::ID3v2::FrameFactory::instance()->setDefaultTextEncoding(TagLib::String::UTF8);
in Amarok?

Source: https://mail.gnome.org/archives/rhythmbox-devel/2006-June/msg00137.html (and others)
Comment 13 Karl Ove Hufthammer 2013-02-16 19:44:20 UTC
Looks like this bug was actually fixed in 2005
https://bugs.kde.org/show_bug.cgi?id=111246
but the fix seems to have been lost in the meantime (in major code changes).
Comment 14 Danilo Luvizotto 2013-03-03 05:24:42 UTC
Today I spent some hours (again) trying to debug this. I also download the latest taglib source code from http://taglib.github.com/releases/taglib-1.8.tar.gz .

After analyzing the code of both Amarok and taglib I couldn't find anything wrong. So I self-compiled the taglib sources I downloaded and installed it. Now this bug doesn't manifest for me anymore.

After some more research, I found out I was using taglib 1.8 from packman (my system is running opensuse Tumbleweed). So I downloaded the sources packman used (packman.links2linux.org/downloadsource/362876/taglib-1.8-54.2.src.rpm) and found the problem: a patch named "taglib-1.8-ds-rusxmms-r2.patch" which packman apply to taglib sources.

So this bug is not an Amarok or taglib bug. It's a bug in the modified sources packman uses.

Thank you everyone for your help!
Comment 15 Danilo Luvizotto 2013-03-03 05:31:39 UTC
One more comment: taglib from original opensuse repo has this bug also. Only self-compiled taglib works fine.
Comment 16 Myriam Schweingruber 2013-03-03 11:14:36 UTC
Good to know, did you report this to Opensuse? Then please provide a link to the bug here.
Comment 17 Danilo Luvizotto 2013-03-03 11:24:58 UTC
openSUSE Tumbleweed doesn't have a bug tracker, as it is a rolling release. I don't have a 12.2 installation, but 12.3 will be release 10 days from now and then my packages will be the same as that release. When that happen, I'll be able to re-test and report a bug to the 12.3 bug tracker.
Comment 18 Myriam Schweingruber 2013-03-03 11:39:00 UTC
Just setting the product right, waiting for a bug link, then.
Comment 19 Scott Wheeler 2013-03-03 12:06:24 UTC
OpenSUSE was for a while (not sure if they still are) applying the RusXMMS patches to TagLib which cause this:

http://lists.opensuse.org/opensuse-bugs/2012-11/msg00539.html
Comment 20 Danilo Luvizotto 2013-03-03 12:15:14 UTC
Unfortunately, they think the bug is fixed and continue to use that patch: https://bugzilla.novell.com/show_bug.cgi?id=780256