Bug 99149 - TagLib violates I18N failing to understang 95% non-Latin1 tags.
Summary: TagLib violates I18N failing to understang 95% non-Latin1 tags.
Alias: None
Product: taglib
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Platform: RedHat RPMs Linux
: NOR wishlist with 20 votes (vote)
Target Milestone: ---
Assignee: Scott Wheeler
Depends on:
Reported: 2005-02-11 19:54 UTC by Илья Казначеев
Modified: 2008-01-31 21:23 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:

Test case (566 bytes, application/octet-stream)
2005-11-19 01:17 UTC, shattered

Note You need to log in before you can comment on or make changes to this bug.
Description Илья Казначеев 2005-02-11 19:54:06 UTC
Version:           1.3.1 (using KDE KDE 3.2.2)
Installed from:    RedHat RPMs
OS:                Linux

TagLib offers developer to select encoding for ID3v2 - latin1, utf8, utf16(le/be).

Developers unaware of i18n (like those who wrote kde metadata plugin) tends to select latin1 (or keep default, that's it). This makes sure no tag with non-european language (like russian) would ever be written properly. Europe gets its latin&umlauts, but everyone other sees only garbage in written tags.

Soultion: Somehow detect if encoding is 8bit/utf8.

REMARK: 95% non-latin1 tags contain standard-prohibited encodings like cp-* mostly (thanx to windows apps, they just don't care when doing everything. so does id3lib.). TagLib offers solution in case of ID3v1 but do nothing in case of ID3v2. 95% tags unreadable, that is. And 100% tags unreadable with default behavior.

http://webcenter.ru/~ilyak/taglib.mp3 - mp3 tagged by taglib.
http://webcenter.ru/~ilyak/id3v2.mp3 - mp3 tagged with id3lib-using id3v2 program.

http://webcenter.ru/~ilyak/juk.jpg - JuK is one of few apps actually having encoding switched to UTF-8, does not help. taglib.mp3 readable, id3v2.mp3 garbage.
http://webcenter.ru/~ilyak/winamp.jpg - Winamp5 - id3v2.mp3 readable, taglib.mp3 garbage. "Thanx comrade Stalin for our brigth childhood" (c) - utterly not std compliant.
http://webcenter.ru/~ilyak/wmp.jpg - Yes, Windows Media Played does the trick displaying tags in both files fine. Haveing some to learn from M$.
Comment 1 Ilya Konstantinov 2005-02-11 20:45:48 UTC
Windows Media Player handles standard ID3v2 Unicode encodings great. Same goes for iTunes. And for Foobar 2000. Does Windows Media Player properly display stupid tags which were added by Winamp?

BTW, Winamp is no longer under proper development and they don't care about non-US markets anyway.
Comment 2 Илья Казначеев 2005-02-16 21:45:34 UTC
Windows Media Player handles both ID3v2 Unicode and ID3v2 8-bit tags great.

ID3v2 8-bit (in locale- or any given encoding, not Latin1) tags work on winamp, wmp, so on. They are produced by winamp, id3v2(id3lib) and probably a lot of other tools.

To fix that problem, one must just add StringHandler support like for ID3v1 for ID3v2 too. And run it in only case that encoding=8bit ("latin1").

By the way, default coding for writing id3v2 is latin1. So basically every app using taglib not changing this default is i18n screwed up.

This is really ugly thing that needs fixing. amaroK player already uses ID3v1 before ID3v2 in case recoding is on, 'cause people keep moaning about their tags are shown incorrectly.
Comment 3 shattered 2005-11-12 18:19:23 UTC
id3v2.mp3 : TIT2 (latin_1) ['\xcf\xee\xf7\xe5\xec\xf3 \xed\xe5 \xe2 \xed\xe0\xec\xee\xf0\xe4\xed\xe8\xea\xe0\xf5?']
id3v2.mp3 : ID3v1 tag found
id3v2.mp3 : song: оНВЕЛС МЕ Б МЮЛНПДМХЙЮУ?

Here we have non-latin1 (cp1251) data in ID3v2 tag.  Wrong.  Same data in v1 tag are technically wrong too, but overwhelming number of existing files are tagged this way.

taglib.mp3 : TIT2 (utf_16) [u'\u041f\u043e\u0447\u0435\u043c\u0443 \u043d\u0435 \u0432 \u043d\u0430\u043c\u043e\u0440\u0434\u043d\u0438\u043a\u0430\u0445?']
taglib.mp3 : ID3v1 tag found
taglib.mp3 : song: ?????? ?? ? ????????????

ID3v2 tag is OK, v1 is garbage.  The application should discard v1 tag completely or convert utf-16-encoded data to single-byte encoding.
Comment 4 Илья Казначеев 2005-11-16 02:21:48 UTC
<i>Here we have non-latin1 (cp1251) data in ID3v2 tag.  Wrong.</i>
id3v2 does not care.
lame does not care.
winamp does not care.

DAMN EVERYONE WRITES SUCH TAGS. So taglib HAVE to handle them. Or else that's like having first fax machine in the world.

<i>ID3v2 tag is OK, v1 is garbage.</i>
So why does taglib, on default setup, write garbage ID3v1 tags?

Please, please, don't tell me about standards. Taglib does not care about standards. It have got an ID3v2 'default encoding' option, which overrides id3v2 tags' encoding field. This cause: 1) latin alphabet people see their latin1 tags even in mistagged files; 2) everyone else have 0% chance to see their tags correctly, because most software is written by latin alphabet people and they would likely force latin1 here, which cause every non-latin tag to break.

Possible solution to latter problem: Make ~/.tagrc file support, where you can say "ID3v2 encoding = UTF16" and force UTF16, or "ID3v2 encoding = intag" and force in-tag encoding field to be used, UNOVERRIDABLE by apps. Without this, taglib is not i18n-compliant (at all).

Possible solution to all this topic: Add recoding string handler into taglib itself (as a regular subclass to StringHandler), and make it be able to recode both ID3v1 and ID3v2 [latin1] tags. Add ~/.tagrc file support with "ID3v1_8bit_encoding = <one of libiconv supported>" and "ID3v2_8bit_encoding = <one of libiconv supported>", again unoverrideable by app (but overrideable if app use custom StringHandler, as it probably know what is it doing.

I can implement this patches if you'll commit them, but I have no clue about that. What I see for now is more like "Let's close our eyes and don't look at cruel non-standard-compliant world". In ideal world with ideal standards that might be fine, but ID3 always was lousy.
Comment 5 shattered 2005-11-19 01:15:45 UTC
Indeed, without calling setDefaultTextEncoding, non-Latin1 strings are rendered incorrectly into v2 tags (test case -- bug99149.cpp, it's in UTF-8), perhaps that should be fixed.

FYI, amarok dropped support for recoding v1 tags recently.
Comment 6 shattered 2005-11-19 01:17:44 UTC
Created attachment 13546 [details]
Test case
Comment 7 shattered 2005-11-19 16:03:30 UTC
all in all, this looks like a dup of bug 90635
Comment 8 Илья Казначеев 2005-11-19 20:18:18 UTC
And WITH setDefaultTextEncoding, there is 0% chance of correctly parsing ID3v2 tag if it happened to be in encoding different from your "default".

No wonder there would be 3 dups a day for a no-go issue which gets ignored by devs.
Comment 9 shattered 2005-11-19 20:57:26 UTC
It's hard to guess what you mean.  Given a UTF-8 string with cyrillic chars, and defaultTextEncoding set to UTF-*, taglib writes perfectly valid v2 tag:

String ru("в чащах юга жил бы цитрус? да, но фальшивый экземпляр.", String::UTF8);

...results in:

joo.mp3 : TIT2 (utf_16) [u'\u0432 \u0447\u0430\u0449\u0430\u0445 \u044e\u0433\u0430 \u0436\u0438\u043b \u0431\u044b \u0446\u0438\u0442\u0440\u0443\u0441? \u0434\u0430, \u043d\u043e \u0444\u0430\u043b\u044c\u0448\u0438\u0432\u044b\u0439 \u044d\u043a\u0437\u0435\u043c\u043f\u043b\u044f\u0440.']
Comment 10 Илья Казначеев 2005-11-19 22:06:02 UTC
Given a FILE with prefectly valid UTF-16 ID3v2 tag, and defaultTextEncoding set to Latin-1, taglib READS perfectly b0rked string. Check yourself, I don't have your toolset.

Don't ask me why defaultTextEncoding is set to Latin-1 - in 95% apps it IS set to Latin-1. Ask yourself - why oh god why this parameter overrides 'encoding' field in tag?
Comment 11 shattered 2005-11-19 23:51:37 UTC
Checking, using taglib 1.4, in a UTF-8 xterm.  The code is:

cout << "current title is: " << MPEG::File(argv[1]).tag()->title().to8Bit(String::UTF8) << endl;

First, the file that you provided as a test case:

% ./bug99149 taglib.mp3 
current title is: Почему не в намордниках?

Next, a random file from my collection that happens to have UTF-8, but non-Cyrillic title:

% ./bug99149 Antique\ -\ Dinata\ Dinata\ \(Dance\ Mix\).mp3
current title is: Δυνατά Δυνατά (Dance Mix)

Antique - Dinata Dinata (Dance Mix).mp3 : TIT2 (utf_8) [u'\u0394\u03c5\u03bd\u03b1\u03c4\u03ac \u0394\u03c5\u03bd\u03b1\u03c4\u03ac (Dance Mix)']
Comment 12 Илья Казначеев 2005-11-20 08:20:33 UTC
Have you setDefaultTextEncoding to Latin-1 before trying that?

If yes, then I don't know. If no, I would say that it works fine until you've setDefaulTextEncoding, and 95% apps seem to always set it at the very beginning.
Comment 13 shattered 2005-11-20 12:47:05 UTC
The result is the same whether I set encoding explicitly to String::Latin1 or not (it's the default).

Problem solved, no?
Comment 14 Илья Казначеев 2005-11-22 17:18:24 UTC
I will chech that again and post results. Please tell me your exact version and additional patches, if any.

AND, there also is a problem with absence of ID3v2 Latin1 recoding facility (because these tags are known to contain, say, cp1251, which needs to be recoded).
Comment 15 shattered 2005-11-22 20:44:06 UTC
taglib 1.4 built from NetBSD pkgsrc on NetBSD-current (3.99.10) by gcc 3.3.3.  See http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/audio/taglib/.
Comment 16 Funda Wang 2005-11-26 07:48:11 UTC
The actural wanted result is:

if (is id3v1)
   if (Content is locale8bit valid)
      Recode locale8bitToUtf-8
      Recode Latin-1
else if (is id3v2)
   if (Content is locale8bit valid)
      Recode locale8bitToUtf-8
   else if (Content is defaultTextEncoding valid)
      Recode defaultTextEncoindgToUtf-8
      Recode Utf-8
Comment 17 shattered 2005-11-26 08:40:13 UTC
The real question is -- how to decide that "Content is locale8bit valid"?
Comment 18 Funda Wang 2005-11-26 09:05:12 UTC
echo $CONTENT | iconv -f `locale locale charmap` -t utf-8

Or, something like that. LC_CTYPE?
Comment 19 Funda Wang 2005-11-26 10:02:19 UTC
And, gstreamer has populated a environment variable:

If taglib could follow this environment variable, it would be of great interests.
Comment 20 shattered 2005-11-26 10:14:13 UTC
This just forces the decision onto the user -- she has to decide if broken tag was decoded correctly, in worst case -- for every file.  Taglib should deliver raw data, not try to guess the intentions of software that produced broken tags.
Comment 21 Funda Wang 2005-11-26 12:01:37 UTC
> she has to decide if broken tag was decoded correctly, in worst
> case -- for every file. 
The worst case you mentioned is very common in non-latin world, including CJK, Russian, Indian, etc.

> not try to guess the intentions of software that produced broken tags.
You mean, all of kio_mp3, amaorK, noatun, Juk, yammi ... should have the ability of selecting encoding for the same codebase?
Comment 22 shattered 2005-11-26 12:46:08 UTC
The only reason why a software might need to select encoding is when it deals with mis-tagged files; once the tags are converted to Unicode, this reason disappears.  AFAIK, there's no tagging software that can do such conversion en-masse; this may be the reason why a lot of users wish for "recoding" features in players.
Comment 23 Funda Wang 2005-11-26 12:57:41 UTC
Nop. As to id3v1, it should contain latin-1 characters only. But in fact, there are a lot of files which are using id3v1 to store locale(ANSI) characters. Besides that, a lot of id3v2 are encoded in ANSI rather than UTF-8.

My suggestion in comment#16 is a quite reasonable solution. Of course, you coulde call it as a different function name from current one, such as GuessTagsEncoding or something like that.
Comment 24 shattered 2005-11-26 14:17:15 UTC
I'm well aware that a lot of existing files have tags that technically violate ID3v1 standard.  But since ID3v2 supports Unicode, there is no valid reason to violate the standard.
Comment 25 Funda Wang 2005-11-27 05:10:17 UTC
The valid reason is there is no Windows applications awaring of the id3 standard, including Windows Media Player. And the users are always complaining about the usability of Linux applications.

And, I don't think my suggestion will break id3 standard support in taglib.
Comment 26 shattered 2005-11-27 14:03:15 UTC
What Windows applications don't handle Unicode in ID3v2?  Shouldn't these apps be fixed in the first place?

The submitter of this bug has tested Windows Media Player (see his screenshots), and it handles Unicode just fine.

My point is -- the standard is made so that different applications will interoperate.  As such, it should be enforced, and misbehaving applications should be fixed.
Comment 27 Scott Wheeler 2005-11-27 19:24:17 UTC
I'll just drop in here with a few notes:

- My opinion is mostly similar to Sergey's, which I've stated in other reports and thusfar didn't feel particularly motivated to repeat here.  (One thing that you may want to note though -- our two "Ilya"s are not the same.  This confused me at first too as at a quick glance one looks like a transliteration of the other.)

- Funda, yes, most Wiindows applications support ID3v2, as was said earlier, that includes WMP, Foobar 2000, iTunes, etc.  I was also impressed to see that as of 5.0 iTunes finally updated their implementation to support ID3v2.4 (and as such UTF-8).

- From the way that I see things, it mostly seems to be that Winamp is broken and that's the core of the problem.  All of the other mentioned things are command line Open Source utilities.  If you want them fixed, please talk to the authors.  Honestly, on Winamp, I just don't really care since well, it's dead.

- If there are other problems with interoperability in applications that actually are using unicode, feel free to report those separately.

- Encoding detection is non-trivial.  And often slow.  See, the problem is that validity isn't straight-forward.  In encodings other than UTF-8, which can be legal or illegal, the others are basically just bitstreams.  Finding out if they're more likely to be one encoding or another requires some knowledge of the languages often used by that encoding and even then, it's naturally just heuristics, not a 100% reliable thing.  That might be useful for something that is working on one paragraph of text, but in something like TagLib, which is often reading 50,000 strings in a few seconds, it's just not acceptable.
Comment 28 Funda Wang 2005-11-28 01:32:39 UTC
Maybe you've understood me. Windows Media Player handles both id3v1 and id3v2 tags perfectly, no matter whether they are standard compliant or not. That is the main idea I want to say. And, for those MP3 files which only contains id3v1 tags, WMP always guess the encoding correctly.

As I said, a experimental encoding guessing method should be privoded as a base of perfect solution.
Comment 29 Thiago Macieira 2005-11-28 01:37:59 UTC
"Perfection stands in the way of good"

also known as the 80%/20% rule
Comment 30 Илья Казначеев 2005-11-28 11:55:04 UTC
Mod #28 and #16 up!

#27: Everything is broken. Winamp, check. Lame, check. ID3v2 (as in utility), check.
Comment 31 LuRan 2005-11-28 14:09:05 UTC
Here is my guess about how WMP works Since I have just 'fixed' all my
non-standard tagged files thanks to amarok, I found that all those non-unicode
text frames set their encoding field to ISO-8859-1, so I think WMP just handle 
unicode text following the standard, and when the encoding field is latin1, it
use the locale encoding instead, since the locale encoding almost certainly
contain latin1 as a subset, the result cannot be worse. And since Windows
don't have a locale mess like linux now. It have a big chance to render the
text correctly.

Funda: help me to prove my theory, try some big-5 or sjis encoded tags in your WMP, see if it can display it correctly.
Comment 32 Scott Wheeler 2005-11-28 15:53:46 UTC
Илья, this isn't Slashdot and "mod them up" is hardly persuasive.  I started to type a responce, but well, it's just repeating what I said above, so I'm not going to bother.  Similarly, repeating yourself here doesn't really make the case more persuasive.  If you have more information to add, or if you'll respond to the individual points that I raised then we have some basis for a discussion on this topic.  For the reasons that I pointed out above, I'm not yet convinced that there's an easy solution to this problem or that it's as critical as you seem to believe.  And since it's me who will make the decision in the end, actually discussing this is more useful than just babbling.
Comment 33 shattered 2005-11-28 17:33:27 UTC
My 5 eurocents:

There are a number of ways to present readable tags to the user:
-- #1, of course, is to convert them to Unicode (and drop v1 tags altogether).  There are tools to do this, but I have not tested any of them:  id3iconv (java, cli) and Unicode Rewriter (gui for id3iconv) -- http://www.cs.berkeley.edu/~zf/id3iconv/ and http://unicoderewriter.sourceforge.net/; id3mod (macos x) -- http://www.macupdate.com/info.php/id/15953; mp3 unicoder (.net) -- http://adam.theficus.com/development/UnicoderInstaller.msi
-- like #1, but also write v1 tags for legacy apps (portable player firmware, for example) in whatever locale encoding they happen to handle best.  Amarok used to support this (the support was removed in 1.4-svn).
-- and for Winamp aficionados -- provide a playlist in #EXTM3U format (with track names listed in #EXTINF tags).
Comment 34 Илья Казначеев 2005-11-28 22:21:40 UTC
Scott Wheeler: The problem is: I can not see my tags at all. I see garbage instead of them. If THAT does not convince you then I don't know what would.

You can't use a tag library that can't provide readable tags.

And, in fact, Funda Wang provided a really helpful algorhythm. I can add only 5 копеек - you should to whatever-locale-is-specified (I do not know, where, maybe in dot-file?) to utf8, not locale8bittoutf8.
Comment 35 LuRan 2005-11-29 05:42:21 UTC
Илья, Funda Wang just provided an idea, not a algorithm, the problem is, as
Scott Wheeler pointed out, there is no feasible way to decide if a tag is
locale8bit valid.

Sergey, I've tried some of these tools, some of them just change the content
to UTF-8, not the encoding field, so taglib will treat them as latin1 encoded
string, and provide totally unreadable garbage, and id3lib have some problem
with Unicode, it cannot write certain Unicode char to the tag, as mentioned in
the source code of kid3, I end up wrote a small tag convert tool using taglib
by myself, and I'm trying to convince the amarok developers to reintroduce the
locale encoding support in the mail list. But Iagree with Funda here, taglib
need some modification to provide the best result.

Scott, I suggest to treat the latin1 encoded ID3v2 tags as ID3v1, at least
provide a similar StringHandler abstraction in taglib. It should not introduce
any regression, and will make taglib much more flexible. And if you don't mind
the dirty tricks, I think taglib could provide a default implementation, guess
the encoding from the locale, or get the charset information from some
environment variables.
Comment 36 Funda Wang 2005-11-29 08:04:53 UTC
c#31: Sorry, I don't know any of the utilities which can tell me the exact id3 structures of my files.

c#35: There IS a fesible way to decide if tag is local8bit valid. iconv() function as I said in c#16 is a perfect solution here. It will set errcode if the string isn't specific charset valid.

Scott, if you are interesting implementing this via environment variable, GST_ID3_TAG_ENCODING as I said in #19 would be a good start. As it is locale independent, which will make a lot of users happier.
Comment 37 LuRan 2005-11-29 08:55:40 UTC
Funda: iconv() can tell you if a tag is local8bit valid, but we cannot relay
on it to test if a tag is local8bit encoded. In a utf-8 locale any string
will be valid. For those relatively short text, like title and album, I guess
we can make up some big5 string that is valid in gbk.
Comment 38 Funda Wang 2005-11-29 12:17:48 UTC
Yes, single title/album/artist tag is too short to guess the encoding, but how about concatenating them together? There is no such case title/artist/album is using completely different encoding in one song.

Not all the strings are valid in utf-8 locale. I think you've mixed up the bytestream and utf-8 strings :) And furthermore, if the implementation is based on another environment variable rather than locale, there is no use to concern about the locale problem. The most important thing is, this is a *experimental* guessing feature, which allows broken guess.
Comment 39 shattered 2005-11-29 15:46:32 UTC
Funda: I use pytagger (http://www.liquidx.net/pytagger/) to dump ID3v2 structures.

LuRan: I am testing id3iconv now, will also test MP3Unicoder.  So far, so good, more details to follow.
Comment 40 LuRan 2005-11-29 16:55:59 UTC
Funda: I admit I was a little bit slack here, but my point is guessing is
always too complex for taglib as a simple lib, not to mention the results are
usually unpredictable. And we don't need to guess, if a field says it is
encoded in unicode, then handle it in the standard way, if a field says it is
latin1, use the locale charset as a substitution. I think this will solve most
of the problem.

Sergey: I never tried id3iconv, because I cannot find the java mp3 library :(.
Comment 41 shattered 2005-11-29 17:20:51 UTC
LuRan: are you building from source?  because the compiled version (id3iconv-0.2.1.jar) runs just fine.
Comment 42 Scott Wheeler 2005-11-29 17:33:35 UTC
LuRan -- just using the locale encoding doesn't really work either.  If I send a file to someone using ISO-8859-1 to someone in a non-ISO-8859-1 locale, assuming that locale isn't a superset of ISO-8859-1 (most aren't) the characters will appear broken there too.  Many encodings are a superset of ASCII 7-bit, not the 8-bit ISO-8859-1.
Comment 43 LuRan 2005-11-30 03:49:46 UTC
Sergey: I've just tried the jar, works pretty good, handy tool ;). But even
though there is a tool for recoding, There are still some situations you
cannot modify the file, listening to some read-only remote files for example,
so a recoding function during the tag reading is still necessary.

Scott: Sorry, I'm not aware of that :(. Then that leaves us two choice, simply
provide a abstration to let the application handle the encoding mess, or take
one step further, as Funda suggested, provide a default, (or just provide one,
not set it to default), implementation to convert the tag according to some
environment variables?
Comment 44 shattered 2005-11-30 08:18:47 UTC
By read-only files, do you mean mp3 streams?  I am not sure that mp3 streaming servers support Unicode in metadata at all...

Funda: have a look at Mozilla Universal Charset Detector (http://www.mozilla.org/projects/intl/chardet.html).
Comment 45 LuRan 2005-11-30 08:38:31 UTC
Sergey: No, Just downloadable mp3 files, like http://somewhere/some.mp3, or
URL in the m3u files. You could download it and use the modified local file,
but some people prefer online music.
Comment 46 shattered 2005-11-30 09:30:29 UTC
These files have to have ID3v2 tags, then, since they're located at the start...  but you've said (in amarok maillist) that even these are broken in China, right?
Comment 47 LuRan 2005-11-30 10:58:42 UTC
I don't know exactly what are you mean, but mp3 files I can find in China
usually have ID3v1 and ID3v2, both in locale encoding and the encoding field
in the ID3v2 tags are set to ISO-8859-1. and I think the requirement to run a
program to convert the tags every time a new mp3 is downloaded is already
unacceptable to a normal user.
Comment 48 Илья Казначеев 2005-11-30 13:52:16 UTC
I already mentioned this:

We read ~/.taglibrc at beginning, where we find: 
ID3v1_encoding = cp1251
ID3v2_encoding = intag
ID3v2_latin1_recode = cp1251

And, we actually recode all id3v1 and all iso-8859-1 id3v2 tags to cp1251, regardless of if we can do this correctly or no. Just because user said us to do this and he might know. If he doesn't it's his problems.
Comment 49 shattered 2005-12-01 08:37:34 UTC
How will application know that taglib is doing the translation?  How does the user turn it off at runtime?

Also, not all iconv libraries are created equal (I'm talking about accuracy of translation tables).  Thus, the result of translation should not be trusted -- treated as read-only, perhaps.
Comment 50 LuRan 2005-12-01 09:49:34 UTC
How about we just provide two interface, one give the the raw latin1 string to
the application. another return the 'should-be-correctly-converted' string. At
least we will not kill any possibility here and give application a choice. 

But at the first step, taglib should give application a possibility to handle
the non-standard latin1 string in ID3v2. With current code we cannot tell,
when handle a ID3v2 tag, if a tag is come from a latin1 encoded field or a
unicode encoded field, thus cannot do any conversion even if we want to.
Comment 51 Brad Taylor 2006-03-29 07:05:12 UTC
Has anything transpired about this problem?  Many of my users are having trouble using UTF-8 strings in mp3 files as they seem to be misrepresented as latin1.

Is it correct (and supported) behaviour to manually set the encoding of mp3 files using the method described in #9?  If this is the RightWayTM, can this method be exported in the C bindings?
Comment 52 shattered 2006-04-03 21:49:48 UTC
I'd just mass-convert all tags to Unicode from whatever encoding they happen to be at the moment (and make sure that consumers of mp3 files support ID3v2 completely).  id3iconv (written in Java) is good enough for this task.

Or am I misunderstanding your problem?
Comment 53 Brad Taylor 2006-04-03 22:22:07 UTC
I guess I didn't mention this but I use Taglib in my music organizer, Cowbell (http://more-cowbell.org) and many of my users have been having problems writing UTF-8 strings in tags when the mp3's original encoding wasn't UTF-8.  From my testing, it seems that TagLib can't/doesn't detect this problem and change the file's encoding accordingly.  

My query was basically, is this a "feature", a bug, or something that can be rectified by using the method described in comment #9?  And further, is this the recommended way for consumers of TagLib to correctly support Unicode?

By the way, I'm using the C bindings to TagLib, but I can ship C++/C glue if neccessary.
Comment 54 Funda Wang 2006-04-04 00:18:52 UTC
> Or am I misunderstanding your problem?
Probably yes :p

The actural problem is that, there are a lot of files which uses non-latin1 characters but ANSI characters in id3v1, or the actural encoding of id3v2 tags does not meet the encoding bit setting of id3v2. And, taglib cannot read those tags correctly. It just comply the standard of id3 tag, without considering any invalid situations.
Comment 55 Alexandre Oliveira 2007-02-25 01:11:07 UTC
*** Bug 142162 has been marked as a duplicate of this bug. ***
Comment 56 Илья Казначеев 2007-02-25 09:53:24 UTC
As of today, amaroK writes correct ID3v2.4 UTF-8 tags.
It still writes incorrect ID3v1 tags in something like UTF-16 (which it shouln't), but correctly reads both ID3v2 UTF-16 and ID3v2.4 UTF-8.

I guess this bug could be closed after some checking.

Comment 57 Scott Wheeler 2008-01-30 13:47:22 UTC
SVN commit 768597 by wheeler:

Don't try to write non-Latin1 values to ID3v1 tags since they'll ugly things will
happen when some of the characters are null.  This behavior can still be customized
via the StringHandler.


 M  +5 -0      id3v1tag.cpp  

WebSVN link: http://websvn.kde.org/?view=rev&revision=768597
Comment 58 Scott Wheeler 2008-01-31 21:09:16 UTC
I added some code to make TagLib automatically switch to writing unicode frames when the string is unicode (I tried to CC this bug, but pasted the wrong bug number in).  At this point that's all I really want to invest into this, so for now I'm closing it as won't-fix.
Comment 59 Илья Казначеев 2008-01-31 21:23:52 UTC
Now it works more or less with unicode ID3v2, so please don't touch it :)

Having said that, thanx for unicode autoswitching.