Version: 1.3.1 (using KDE KDE 3.2.2)
Installed from: RedHat RPMs
TagLib offers developer to select encoding for ID3v2 - latin1, utf8, utf16(le/be).
Developers unaware of i18n (like those who wrote kde metadata plugin) tends to select latin1 (or keep default, that's it). This makes sure no tag with non-european language (like russian) would ever be written properly. Europe gets its latin¨auts, but everyone other sees only garbage in written tags.
Soultion: Somehow detect if encoding is 8bit/utf8.
REMARK: 95% non-latin1 tags contain standard-prohibited encodings like cp-* mostly (thanx to windows apps, they just don't care when doing everything. so does id3lib.). TagLib offers solution in case of ID3v1 but do nothing in case of ID3v2. 95% tags unreadable, that is. And 100% tags unreadable with default behavior.
http://webcenter.ru/~ilyak/taglib.mp3 - mp3 tagged by taglib.
http://webcenter.ru/~ilyak/id3v2.mp3 - mp3 tagged with id3lib-using id3v2 program.
http://webcenter.ru/~ilyak/juk.jpg - JuK is one of few apps actually having encoding switched to UTF-8, does not help. taglib.mp3 readable, id3v2.mp3 garbage.
http://webcenter.ru/~ilyak/winamp.jpg - Winamp5 - id3v2.mp3 readable, taglib.mp3 garbage. "Thanx comrade Stalin for our brigth childhood" (c) - utterly not std compliant.
http://webcenter.ru/~ilyak/wmp.jpg - Yes, Windows Media Played does the trick displaying tags in both files fine. Haveing some to learn from M$.
Windows Media Player handles standard ID3v2 Unicode encodings great. Same goes for iTunes. And for Foobar 2000. Does Windows Media Player properly display stupid tags which were added by Winamp?
BTW, Winamp is no longer under proper development and they don't care about non-US markets anyway.
Windows Media Player handles both ID3v2 Unicode and ID3v2 8-bit tags great.
ID3v2 8-bit (in locale- or any given encoding, not Latin1) tags work on winamp, wmp, so on. They are produced by winamp, id3v2(id3lib) and probably a lot of other tools.
To fix that problem, one must just add StringHandler support like for ID3v1 for ID3v2 too. And run it in only case that encoding=8bit ("latin1").
By the way, default coding for writing id3v2 is latin1. So basically every app using taglib not changing this default is i18n screwed up.
This is really ugly thing that needs fixing. amaroK player already uses ID3v1 before ID3v2 in case recoding is on, 'cause people keep moaning about their tags are shown incorrectly.
id3v2.mp3 : TIT2 (latin_1) ['\xcf\xee\xf7\xe5\xec\xf3 \xed\xe5 \xe2 \xed\xe0\xec\xee\xf0\xe4\xed\xe8\xea\xe0\xf5?']
id3v2.mp3 : ID3v1 tag found
id3v2.mp3 : song: оНВЕЛС МЕ Б МЮЛНПДМХЙЮУ?
Here we have non-latin1 (cp1251) data in ID3v2 tag. Wrong. Same data in v1 tag are technically wrong too, but overwhelming number of existing files are tagged this way.
taglib.mp3 : TIT2 (utf_16) [u'\u041f\u043e\u0447\u0435\u043c\u0443 \u043d\u0435 \u0432 \u043d\u0430\u043c\u043e\u0440\u0434\u043d\u0438\u043a\u0430\u0445?']
taglib.mp3 : ID3v1 tag found
taglib.mp3 : song: ?????? ?? ? ????????????
ID3v2 tag is OK, v1 is garbage. The application should discard v1 tag completely or convert utf-16-encoded data to single-byte encoding.
<i>Here we have non-latin1 (cp1251) data in ID3v2 tag. Wrong.</i>
id3v2 does not care.
lame does not care.
winamp does not care.
DAMN EVERYONE WRITES SUCH TAGS. So taglib HAVE to handle them. Or else that's like having first fax machine in the world.
<i>ID3v2 tag is OK, v1 is garbage.</i>
So why does taglib, on default setup, write garbage ID3v1 tags?
Please, please, don't tell me about standards. Taglib does not care about standards. It have got an ID3v2 'default encoding' option, which overrides id3v2 tags' encoding field. This cause: 1) latin alphabet people see their latin1 tags even in mistagged files; 2) everyone else have 0% chance to see their tags correctly, because most software is written by latin alphabet people and they would likely force latin1 here, which cause every non-latin tag to break.
Possible solution to latter problem: Make ~/.tagrc file support, where you can say "ID3v2 encoding = UTF16" and force UTF16, or "ID3v2 encoding = intag" and force in-tag encoding field to be used, UNOVERRIDABLE by apps. Without this, taglib is not i18n-compliant (at all).
Possible solution to all this topic: Add recoding string handler into taglib itself (as a regular subclass to StringHandler), and make it be able to recode both ID3v1 and ID3v2 [latin1] tags. Add ~/.tagrc file support with "ID3v1_8bit_encoding = <one of libiconv supported>" and "ID3v2_8bit_encoding = <one of libiconv supported>", again unoverrideable by app (but overrideable if app use custom StringHandler, as it probably know what is it doing.
I can implement this patches if you'll commit them, but I have no clue about that. What I see for now is more like "Let's close our eyes and don't look at cruel non-standard-compliant world". In ideal world with ideal standards that might be fine, but ID3 always was lousy.
Indeed, without calling setDefaultTextEncoding, non-Latin1 strings are rendered incorrectly into v2 tags (test case -- bug99149.cpp, it's in UTF-8), perhaps that should be fixed.
FYI, amarok dropped support for recoding v1 tags recently.
Created attachment 13546 [details]
all in all, this looks like a dup of bug 90635
And WITH setDefaultTextEncoding, there is 0% chance of correctly parsing ID3v2 tag if it happened to be in encoding different from your "default".
No wonder there would be 3 dups a day for a no-go issue which gets ignored by devs.
It's hard to guess what you mean. Given a UTF-8 string with cyrillic chars, and defaultTextEncoding set to UTF-*, taglib writes perfectly valid v2 tag:
String ru("в чащах юга жил бы цитрус? да, но фальшивый экземпляр.", String::UTF8);
joo.mp3 : TIT2 (utf_16) [u'\u0432 \u0447\u0430\u0449\u0430\u0445 \u044e\u0433\u0430 \u0436\u0438\u043b \u0431\u044b \u0446\u0438\u0442\u0440\u0443\u0441? \u0434\u0430, \u043d\u043e \u0444\u0430\u043b\u044c\u0448\u0438\u0432\u044b\u0439 \u044d\u043a\u0437\u0435\u043c\u043f\u043b\u044f\u0440.']
Given a FILE with prefectly valid UTF-16 ID3v2 tag, and defaultTextEncoding set to Latin-1, taglib READS perfectly b0rked string. Check yourself, I don't have your toolset.
Don't ask me why defaultTextEncoding is set to Latin-1 - in 95% apps it IS set to Latin-1. Ask yourself - why oh god why this parameter overrides 'encoding' field in tag?
Checking, using taglib 1.4, in a UTF-8 xterm. The code is:
cout << "current title is: " << MPEG::File(argv).tag()->title().to8Bit(String::UTF8) << endl;
First, the file that you provided as a test case:
% ./bug99149 taglib.mp3
current title is: Почему не в намордниках?
Next, a random file from my collection that happens to have UTF-8, but non-Cyrillic title:
% ./bug99149 Antique\ -\ Dinata\ Dinata\ \(Dance\ Mix\).mp3
current title is: Δυνατά Δυνατά (Dance Mix)
Antique - Dinata Dinata (Dance Mix).mp3 : TIT2 (utf_8) [u'\u0394\u03c5\u03bd\u03b1\u03c4\u03ac \u0394\u03c5\u03bd\u03b1\u03c4\u03ac (Dance Mix)']
Have you setDefaultTextEncoding to Latin-1 before trying that?
If yes, then I don't know. If no, I would say that it works fine until you've setDefaulTextEncoding, and 95% apps seem to always set it at the very beginning.
The result is the same whether I set encoding explicitly to String::Latin1 or not (it's the default).
Problem solved, no?
I will chech that again and post results. Please tell me your exact version and additional patches, if any.
AND, there also is a problem with absence of ID3v2 Latin1 recoding facility (because these tags are known to contain, say, cp1251, which needs to be recoded).
taglib 1.4 built from NetBSD pkgsrc on NetBSD-current (3.99.10) by gcc 3.3.3. See http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/audio/taglib/.
The actural wanted result is:
if (is id3v1)
if (Content is locale8bit valid)
else if (is id3v2)
if (Content is locale8bit valid)
else if (Content is defaultTextEncoding valid)
The real question is -- how to decide that "Content is locale8bit valid"?
echo $CONTENT | iconv -f `locale locale charmap` -t utf-8
Or, something like that. LC_CTYPE?
And, gstreamer has populated a environment variable:
If taglib could follow this environment variable, it would be of great interests.
This just forces the decision onto the user -- she has to decide if broken tag was decoded correctly, in worst case -- for every file. Taglib should deliver raw data, not try to guess the intentions of software that produced broken tags.
> she has to decide if broken tag was decoded correctly, in worst
> case -- for every file.
The worst case you mentioned is very common in non-latin world, including CJK, Russian, Indian, etc.
> not try to guess the intentions of software that produced broken tags.
You mean, all of kio_mp3, amaorK, noatun, Juk, yammi ... should have the ability of selecting encoding for the same codebase?
The only reason why a software might need to select encoding is when it deals with mis-tagged files; once the tags are converted to Unicode, this reason disappears. AFAIK, there's no tagging software that can do such conversion en-masse; this may be the reason why a lot of users wish for "recoding" features in players.
Nop. As to id3v1, it should contain latin-1 characters only. But in fact, there are a lot of files which are using id3v1 to store locale(ANSI) characters. Besides that, a lot of id3v2 are encoded in ANSI rather than UTF-8.
My suggestion in comment#16 is a quite reasonable solution. Of course, you coulde call it as a different function name from current one, such as GuessTagsEncoding or something like that.
I'm well aware that a lot of existing files have tags that technically violate ID3v1 standard. But since ID3v2 supports Unicode, there is no valid reason to violate the standard.
The valid reason is there is no Windows applications awaring of the id3 standard, including Windows Media Player. And the users are always complaining about the usability of Linux applications.
And, I don't think my suggestion will break id3 standard support in taglib.
What Windows applications don't handle Unicode in ID3v2? Shouldn't these apps be fixed in the first place?
The submitter of this bug has tested Windows Media Player (see his screenshots), and it handles Unicode just fine.
My point is -- the standard is made so that different applications will interoperate. As such, it should be enforced, and misbehaving applications should be fixed.
I'll just drop in here with a few notes:
- My opinion is mostly similar to Sergey's, which I've stated in other reports and thusfar didn't feel particularly motivated to repeat here. (One thing that you may want to note though -- our two "Ilya"s are not the same. This confused me at first too as at a quick glance one looks like a transliteration of the other.)
- Funda, yes, most Wiindows applications support ID3v2, as was said earlier, that includes WMP, Foobar 2000, iTunes, etc. I was also impressed to see that as of 5.0 iTunes finally updated their implementation to support ID3v2.4 (and as such UTF-8).
- From the way that I see things, it mostly seems to be that Winamp is broken and that's the core of the problem. All of the other mentioned things are command line Open Source utilities. If you want them fixed, please talk to the authors. Honestly, on Winamp, I just don't really care since well, it's dead.
- If there are other problems with interoperability in applications that actually are using unicode, feel free to report those separately.
- Encoding detection is non-trivial. And often slow. See, the problem is that validity isn't straight-forward. In encodings other than UTF-8, which can be legal or illegal, the others are basically just bitstreams. Finding out if they're more likely to be one encoding or another requires some knowledge of the languages often used by that encoding and even then, it's naturally just heuristics, not a 100% reliable thing. That might be useful for something that is working on one paragraph of text, but in something like TagLib, which is often reading 50,000 strings in a few seconds, it's just not acceptable.
Maybe you've understood me. Windows Media Player handles both id3v1 and id3v2 tags perfectly, no matter whether they are standard compliant or not. That is the main idea I want to say. And, for those MP3 files which only contains id3v1 tags, WMP always guess the encoding correctly.
As I said, a experimental encoding guessing method should be privoded as a base of perfect solution.
"Perfection stands in the way of good"
also known as the 80%/20% rule
Mod #28 and #16 up!
#27: Everything is broken. Winamp, check. Lame, check. ID3v2 (as in utility), check.
Here is my guess about how WMP works Since I have just 'fixed' all my
non-standard tagged files thanks to amarok, I found that all those non-unicode
text frames set their encoding field to ISO-8859-1, so I think WMP just handle
unicode text following the standard, and when the encoding field is latin1, it
use the locale encoding instead, since the locale encoding almost certainly
contain latin1 as a subset, the result cannot be worse. And since Windows
don't have a locale mess like linux now. It have a big chance to render the
Funda: help me to prove my theory, try some big-5 or sjis encoded tags in your WMP, see if it can display it correctly.
Илья, this isn't Slashdot and "mod them up" is hardly persuasive. I started to type a responce, but well, it's just repeating what I said above, so I'm not going to bother. Similarly, repeating yourself here doesn't really make the case more persuasive. If you have more information to add, or if you'll respond to the individual points that I raised then we have some basis for a discussion on this topic. For the reasons that I pointed out above, I'm not yet convinced that there's an easy solution to this problem or that it's as critical as you seem to believe. And since it's me who will make the decision in the end, actually discussing this is more useful than just babbling.
My 5 eurocents:
There are a number of ways to present readable tags to the user:
-- #1, of course, is to convert them to Unicode (and drop v1 tags altogether). There are tools to do this, but I have not tested any of them: id3iconv (java, cli) and Unicode Rewriter (gui for id3iconv) -- http://www.cs.berkeley.edu/~zf/id3iconv/ and http://unicoderewriter.sourceforge.net/; id3mod (macos x) -- http://www.macupdate.com/info.php/id/15953; mp3 unicoder (.net) -- http://adam.theficus.com/development/UnicoderInstaller.msi
-- like #1, but also write v1 tags for legacy apps (portable player firmware, for example) in whatever locale encoding they happen to handle best. Amarok used to support this (the support was removed in 1.4-svn).
-- and for Winamp aficionados -- provide a playlist in #EXTM3U format (with track names listed in #EXTINF tags).
Scott Wheeler: The problem is: I can not see my tags at all. I see garbage instead of them. If THAT does not convince you then I don't know what would.
You can't use a tag library that can't provide readable tags.
And, in fact, Funda Wang provided a really helpful algorhythm. I can add only 5 копеек - you should to whatever-locale-is-specified (I do not know, where, maybe in dot-file?) to utf8, not locale8bittoutf8.
Илья, Funda Wang just provided an idea, not a algorithm, the problem is, as
Scott Wheeler pointed out, there is no feasible way to decide if a tag is
Sergey, I've tried some of these tools, some of them just change the content
to UTF-8, not the encoding field, so taglib will treat them as latin1 encoded
string, and provide totally unreadable garbage, and id3lib have some problem
with Unicode, it cannot write certain Unicode char to the tag, as mentioned in
the source code of kid3, I end up wrote a small tag convert tool using taglib
by myself, and I'm trying to convince the amarok developers to reintroduce the
locale encoding support in the mail list. But Iagree with Funda here, taglib
need some modification to provide the best result.
Scott, I suggest to treat the latin1 encoded ID3v2 tags as ID3v1, at least
provide a similar StringHandler abstraction in taglib. It should not introduce
any regression, and will make taglib much more flexible. And if you don't mind
the dirty tricks, I think taglib could provide a default implementation, guess
the encoding from the locale, or get the charset information from some
c#31: Sorry, I don't know any of the utilities which can tell me the exact id3 structures of my files.
c#35: There IS a fesible way to decide if tag is local8bit valid. iconv() function as I said in c#16 is a perfect solution here. It will set errcode if the string isn't specific charset valid.
Scott, if you are interesting implementing this via environment variable, GST_ID3_TAG_ENCODING as I said in #19 would be a good start. As it is locale independent, which will make a lot of users happier.
Funda: iconv() can tell you if a tag is local8bit valid, but we cannot relay
on it to test if a tag is local8bit encoded. In a utf-8 locale any string
will be valid. For those relatively short text, like title and album, I guess
we can make up some big5 string that is valid in gbk.
Yes, single title/album/artist tag is too short to guess the encoding, but how about concatenating them together? There is no such case title/artist/album is using completely different encoding in one song.
Not all the strings are valid in utf-8 locale. I think you've mixed up the bytestream and utf-8 strings :) And furthermore, if the implementation is based on another environment variable rather than locale, there is no use to concern about the locale problem. The most important thing is, this is a *experimental* guessing feature, which allows broken guess.
Funda: I use pytagger (http://www.liquidx.net/pytagger/) to dump ID3v2 structures.
LuRan: I am testing id3iconv now, will also test MP3Unicoder. So far, so good, more details to follow.
Funda: I admit I was a little bit slack here, but my point is guessing is
always too complex for taglib as a simple lib, not to mention the results are
usually unpredictable. And we don't need to guess, if a field says it is
encoded in unicode, then handle it in the standard way, if a field says it is
latin1, use the locale charset as a substitution. I think this will solve most
of the problem.
Sergey: I never tried id3iconv, because I cannot find the java mp3 library :(.
LuRan: are you building from source? because the compiled version (id3iconv-0.2.1.jar) runs just fine.
LuRan -- just using the locale encoding doesn't really work either. If I send a file to someone using ISO-8859-1 to someone in a non-ISO-8859-1 locale, assuming that locale isn't a superset of ISO-8859-1 (most aren't) the characters will appear broken there too. Many encodings are a superset of ASCII 7-bit, not the 8-bit ISO-8859-1.
Sergey: I've just tried the jar, works pretty good, handy tool ;). But even
though there is a tool for recoding, There are still some situations you
cannot modify the file, listening to some read-only remote files for example,
so a recoding function during the tag reading is still necessary.
Scott: Sorry, I'm not aware of that :(. Then that leaves us two choice, simply
provide a abstration to let the application handle the encoding mess, or take
one step further, as Funda suggested, provide a default, (or just provide one,
not set it to default), implementation to convert the tag according to some
By read-only files, do you mean mp3 streams? I am not sure that mp3 streaming servers support Unicode in metadata at all...
Funda: have a look at Mozilla Universal Charset Detector (http://www.mozilla.org/projects/intl/chardet.html).
Sergey: No, Just downloadable mp3 files, like http://somewhere/some.mp3, or
URL in the m3u files. You could download it and use the modified local file,
but some people prefer online music.
These files have to have ID3v2 tags, then, since they're located at the start... but you've said (in amarok maillist) that even these are broken in China, right?
I don't know exactly what are you mean, but mp3 files I can find in China
usually have ID3v1 and ID3v2, both in locale encoding and the encoding field
in the ID3v2 tags are set to ISO-8859-1. and I think the requirement to run a
program to convert the tags every time a new mp3 is downloaded is already
unacceptable to a normal user.
I already mentioned this:
We read ~/.taglibrc at beginning, where we find:
ID3v1_encoding = cp1251
ID3v2_encoding = intag
ID3v2_latin1_recode = cp1251
And, we actually recode all id3v1 and all iso-8859-1 id3v2 tags to cp1251, regardless of if we can do this correctly or no. Just because user said us to do this and he might know. If he doesn't it's his problems.
How will application know that taglib is doing the translation? How does the user turn it off at runtime?
Also, not all iconv libraries are created equal (I'm talking about accuracy of translation tables). Thus, the result of translation should not be trusted -- treated as read-only, perhaps.
How about we just provide two interface, one give the the raw latin1 string to
the application. another return the 'should-be-correctly-converted' string. At
least we will not kill any possibility here and give application a choice.
But at the first step, taglib should give application a possibility to handle
the non-standard latin1 string in ID3v2. With current code we cannot tell,
when handle a ID3v2 tag, if a tag is come from a latin1 encoded field or a
unicode encoded field, thus cannot do any conversion even if we want to.
Has anything transpired about this problem? Many of my users are having trouble using UTF-8 strings in mp3 files as they seem to be misrepresented as latin1.
Is it correct (and supported) behaviour to manually set the encoding of mp3 files using the method described in #9? If this is the RightWayTM, can this method be exported in the C bindings?
I'd just mass-convert all tags to Unicode from whatever encoding they happen to be at the moment (and make sure that consumers of mp3 files support ID3v2 completely). id3iconv (written in Java) is good enough for this task.
Or am I misunderstanding your problem?
I guess I didn't mention this but I use Taglib in my music organizer, Cowbell (http://more-cowbell.org) and many of my users have been having problems writing UTF-8 strings in tags when the mp3's original encoding wasn't UTF-8. From my testing, it seems that TagLib can't/doesn't detect this problem and change the file's encoding accordingly.
My query was basically, is this a "feature", a bug, or something that can be rectified by using the method described in comment #9? And further, is this the recommended way for consumers of TagLib to correctly support Unicode?
By the way, I'm using the C bindings to TagLib, but I can ship C++/C glue if neccessary.
> Or am I misunderstanding your problem?
Probably yes :p
The actural problem is that, there are a lot of files which uses non-latin1 characters but ANSI characters in id3v1, or the actural encoding of id3v2 tags does not meet the encoding bit setting of id3v2. And, taglib cannot read those tags correctly. It just comply the standard of id3 tag, without considering any invalid situations.
*** Bug 142162 has been marked as a duplicate of this bug. ***
As of today, amaroK writes correct ID3v2.4 UTF-8 tags.
It still writes incorrect ID3v1 tags in something like UTF-16 (which it shouln't), but correctly reads both ID3v2 UTF-16 and ID3v2.4 UTF-8.
I guess this bug could be closed after some checking.
SVN commit 768597 by wheeler:
Don't try to write non-Latin1 values to ID3v1 tags since they'll ugly things will
happen when some of the characters are null. This behavior can still be customized
via the StringHandler.
M +5 -0 id3v1tag.cpp
WebSVN link: http://websvn.kde.org/?view=rev&revision=768597
I added some code to make TagLib automatically switch to writing unicode frames when the string is unicode (I tried to CC this bug, but pasted the wrong bug number in). At this point that's all I really want to invest into this, so for now I'm closing it as won't-fix.
Now it works more or less with unicode ID3v2, so please don't touch it :)
Having said that, thanx for unicode autoswitching.