200596 – [Patch] id3v1 japanese characters encoding

Bug 200596 - [Patch] id3v1 japanese characters encoding

Summary: [Patch] id3v1 japanese characters encoding

Status:	RESOLVED FIXED

Alias:	None

Product:	amarok
Classification:	Applications
Component:	Collection Browser (show other bugs)
Version:	2.3-GIT
Platform:	unspecified Linux

Importance:	NOR normal
Target Milestone:	2.2.2
Assignee:	Amarok Developers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-07-17 22:06 UTC by Ignacio Serantes
Modified:	2010-01-15 23:36 UTC (History)
CC List:	8 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Fix the encoding problem in amarokcollectionscanner (802 bytes, patch) 2009-09-22 03:16 UTC, Cesar Garcia	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Ignacio Serantes 2009-07-17 22:06:29 UTC

Version:           2.1.1 (using 4.2.96 (KDE 4.2.96 (KDE 4.3 RC2)) "release 142", KDE:KDE4:Factory:Desktop / openSUSE_11.1)
Compiler:          gcc
OS:                Linux (x86_64) release 2.6.27.23-0.1-default

In last versions japanese and korean (and probably other UTF-8 characters) characters are not recognized from id3v1 and Amarok displays ???.

This don't happen with previous versions.

If you change names using Amarok id3v1 tags are deleted when japanese characters was available.

A few encoding test in Amarok with non english characters would be a good idea.

Comment 1 Myriam Schweingruber 2009-07-18 12:55:43 UTC

FWIW, you should use id3v2.4, v1 is very old already.

This is fixed in 2.2-SVN

Comment 2 Ignacio Serantes 2009-07-19 01:58:09 UTC

Well, the problem is that my mp3 has both versions so, if version 1 is very old, why Amarok look first in version 1 if version 2 is fine?

Comment 3 Myriam Schweingruber 2009-07-19 10:28:38 UTC

(In reply to comment #2)
> Well, the problem is that my mp3 has both versions so, if version 1 is very
> old, why Amarok look first in version 1 if version 2 is fine?

You should convert to one single format to avoid problems IMHO. Amarok scans for id3 tags in general, but for tag consistency make sure you have only one version.

Applications like kid3 or easytag allow you to convert to a newer version easily.

Comment 4 Ignacio Serantes 2009-07-19 12:25:13 UTC

I'm using kid3 for tagging and both tags have the same values. I can't only use one version because my car mp3 player software can't read version 2 tags.

I still can't see the problem of the next lines of code

if can't read_id3v2
  read_id3v1  
end

and the problem is solved.

Comment 5 Myriam Schweingruber 2009-07-19 12:30:10 UTC

Did you read what I said in comment #2, the part about: "that is solved in 2.2-SVN"?

Comment 6 Ignacio Serantes 2009-07-19 12:46:38 UTC

Yes, I'm reply comment #3 "you should convert to one single format"

Comment 7 Ignacio Serantes 2009-08-19 11:48:08 UTC

Bug is not fixed, because reading tags of some files fails.

I'm currently using Amarok 2.2-git and I have problems reading information with Amarok. Amarok 1.4, Exaile, mplayer and Kid3 have no problem with this files but Amarok can't read correctly.

I do the next test:
1) Generate tags with amarok: fails
2) Use Kid3 with UTF-8: fails
3) Convert to version 2.4: fails
4) Convert to version 2.3: fails
5) Erase 1.1 tag: fails

After all of these test I do the next process:
1) Erase Amarok database.
2) Run again Amarok.

This is a regresion because with Amarok 2.1 I have no problems with this files.

Comment 8 Ignacio Serantes 2009-08-21 11:27:22 UTC

I don't know it this can help but I'm using official openSUSE 11.1 taglib packages:

taglib 1.5-18.89
taglib-devel 1.5-18.89
taglib-extras 0.1.6-7.1
taglib-extras-devel 0.1.6-7.1

Comment 9 Ignacio Serantes 2009-08-23 16:06:20 UTC

When Amarok scans for music sometimes the next message appears on console:

TagLib: String::prepare() - Unicode conversion error.

Comment 10 Myriam Schweingruber 2009-09-01 10:30:57 UTC

Jeff, could you have a look at that eventually? Could be a taglib problem I guess.

Comment 11 Ignacio Serantes 2009-09-01 11:04:38 UTC

I don't know if the next can be useful but when I build Amarok the next warnings are displayed several times:

/usr/include/taglib/id3v1tag.h:61: warning: ‘class TagLib::ID3v1::StringHandler’ has virtual functions and accessible non-virtual destructor   

/usr/include/taglib/fileref.h:93: warning: ‘class TagLib::FileRef::FileTypeResolver’ has virtual functions and accessible non-virtual destructor

Comment 12 Ignacio Serantes 2009-09-01 17:31:09 UTC

Looking for taglib I found the next page

http://developer.kde.org/~wheeler/taglib.html

and I try adding bad files to JuK, a test that I don't done before, and JuK reads tags without problems.

Comment 13 Jeff Mitchell 2009-09-02 19:57:09 UTC

This is probably not related to my stuff but rather Peter's work with the charset detector.

I would strongly suggest providing one of the affected files for us to look at, since without that I can't even begin to hazard a guess what the issue is.

I can tell you that:
a) AFAIK Amarok does prefer ID3v2 tags, and
b) UTF-8 is never valid in ID3v1

Comment 14 Ignacio Serantes 2009-09-03 19:17:02 UTC

(In reply to comment #13)
> This is probably not related to my stuff but rather Peter's work with the
> charset detector.
> 
> I would strongly suggest providing one of the affected files for us to look at,
> since without that I can't even begin to hazard a guess what the issue is.
> 
> I can tell you that:
> a) AFAIK Amarok does prefer ID3v2 tags, and
> b) UTF-8 is never valid in ID3v1

I have no problem in send one of this files.

I can tell you that:
a) This must be the correct behaviour, I think.
b) kid3, id3v2 and id3info supports UTF8 in ID3v1 and, based in a) these wrong? encoding must be ignored by Amarok because it prefers ID3v2.

On the other side, I don't detect any prohibition about UTF-8 usage here http://mpgedit.org/mpgedit/mpeg_format/mpeghdr.htm but seems to be assumed that ID3v1 use ISO-8859-1 but not for all programs.

Comment 15 Jeff Mitchell 2009-09-03 20:42:19 UTC

Note the caveat in the document: "I do not claim information presented in this document is accurate. At first I just gathered it from different sources. It was not an easy task but I needed it. Later, I received lots of comments as feedback when I published this document. I think this last release is highly accurate due to comments and corrections I received. "

The comment entirely backs up the point I was going to make, which is: there is no official ID3v1 spec. There never was. There's just a mishmash of specs various people have written up and what people end up using and hope that it works with other players. Sometimes it does, sometimes it doesn't. Following on, UTF-8 is definitely not at all standardized in ID3v1.

The reason so may (especially cheap) hardware players use ID3v1 is because it's a fixed number of bytes at the end of the file, which makes hardware implementation rather trivial. But that doesn't change that it's a hardware implementation of a non-spec.

Since you can't provide files exhibiting problems that Peter can look at with the charset detector, marking as INVALID...if you can drum up some files, re-open.

Comment 16 Ignacio Serantes 2009-09-03 22:07:39 UTC

My english must be deteriorating because I have bad files and I can send it. I don't know why do you think that "I can't provide files'. 

Tell me where I must send it.

Comment 17 Peter ZHOU 2009-09-04 03:03:56 UTC

send a copy to me please: peterzhoulei AT gmail
Thanks.

Comment 18 Ignacio Serantes 2009-09-04 15:45:09 UTC

(In reply to comment #17)
> send a copy to me please: peterzhoulei AT gmail
> Thanks.

I send you the mp3. My database is in a Mysql server so, if you wish, I can add an account for you.

Comment 19 Ignacio Serantes 2009-09-04 21:34:01 UTC

I'm learning a little more about Amarok and I found amarokcollectionscanner :). This is the output for the file I send you and for the second track of the single. I don't understand why scanner fails with the first and works with the second.

<tags title="????" compilation="checkforvarious" bitrate="320" uniqueid="amarok-sqltrackuid://e18c9f1aa9ee0fe3c35d9f4d78f2d3f1" filetype="0" track="1" artist="????" path="/storage/storage/music/mp3/jpop/高田梢枝/Eureka 7 ED Single - 秘密基地/01 - Himitsu Kichi.mp3" filesize="11850591" length="296" samplerate="44100" album="????" comment="" genre="Jpop" year="2005" audioproperties="true" />

<tags title="インコ" compilation="checkforvarious" bitrate="320" uniqueid="amarok-sqltrackuid://f1359d5f9b0992a7feaf2b3069d890b4" filetype="0" track="2" artist="高田梢枝" path="/storage/storage/music/mp3/jpop/高田梢枝/Eureka 7 ED Single - 秘密基地/02 - Inko.mp3" filesize="15724544" length="393" samplerate="44100" album="秘密基地" comment="" genre="Jpop" year="2005" audioproperties="true" />

Comment 20 Cesar Garcia 2009-09-22 03:15:04 UTC

Thanks at #19 for the tip. Looks like the charset-detector routine in CollectionScanner.cpp is broken, but only because already UTF(8-16) strings are passed to the chardet_get_charset function and because those strings are missing the unicode BOM the function fails at identify the string as UTF-16 (on my particular case) and gets a incorrect encoding (gb18030) then encodes the string incorrectly as utf8 and obtains garbage. I made a patch that skip the charset-detector part if a non Latin1 string is detected and now my collection is displayed correctly, i am not sure if this works for all the encoding, but for my collection of japanese UTF-16, iso8859-1 and some Shift_JIS tags it works perfectly.

Comment 21 Cesar Garcia 2009-09-22 03:16:50 UTC

Created attachment 37107 [details]
Fix the encoding problem in amarokcollectionscanner

Comment 22 Myriam Schweingruber 2009-09-22 10:41:15 UTC

Jeff, could you please have a look at this patch?

Comment 23 Ignacio Serantes 2009-09-22 10:42:33 UTC

Cesar I try your patch and works. Great work! :D.

Comment 24 Jeff Mitchell 2009-09-22 13:18:18 UTC

I'm not the right person to look at this. Peter?

Comment 25 Myriam Schweingruber 2009-10-11 11:00:32 UTC

*** Bug 210101 has been marked as a duplicate of this bug. ***

Comment 26 Rick W. Chen 2009-10-13 11:06:53 UTC

I can confirm Cesar's patch works for me. All the ??? in 
the collection are now shown correctly after a full rescan.

Comment 27 Ignacio Serantes 2009-10-13 12:09:59 UTC

Two confirmations tha patch works, three if we count Cesar, but Peter seems to be so busy to test the patch.

Is there any way that I can help to confirm that the patch is correct?

Comment 28 Ignacio Serantes 2009-11-13 13:46:03 UTC

Any news about this bug. Was reported five months ago and even there is a patch that solves the problem.

In my home I compile Amarok so I can patch it but I can't do that in my office.

Comment 29 Jeff Mitchell 2009-11-13 14:34:24 UTC

Hi,

We haven't heard back from Peter about this and I'm hesitant to use the patch because latin1 doesn't cover all characters that may be seen (such as Cyrillic characters). But I also admit I'm not entirely sure why the code is doing what it does -- for instance, Amarok supports UTF-8 strings, so why does it convert the strings to unicode, then back to latin1? Overall, I'm not really comfortable with forcing this on users, even though it does help some.

Yesterday I independently added the ability to skip the charset detector during scan; for people with properly tagged files this should prevent issues like this from happening. It's off by default (meaning don't run the charset detector) although for people with existing Amarok installs it defaults to on (to keep the current behavior). You can turn it off by going to Settings->Collection.

I couldn't do this earlier because we were in string freeze for 2.2.1 and I couldn't introduce the new string. But this will be in 2.2.2 and is in current Git.

I'm still open to applying this patch, but I need someone in the know to convince me why it's right, other than that it works for your particular files.  :-)  Aren't there encodings for which they could be handled but with charsets outside of latin1? And, does anyone know why it converts to utf8 and then back to latin1, and if this is being a detriment or if simply removing those toLatin1 calls would possibly fix this?

Comment 30 Ignacio Serantes 2009-11-13 15:25:38 UTC

Jeff, thank you for your effort. I will try your changes this weekend.

Comment 31 Cesar Garcia 2009-11-15 18:05:11 UTC

I dont know really if the problem is more of taglib or amarok. Seeing the code of the ::prepare func in taglib/toolkit/tstring.cpp, it strips the unicode BOM if it finds one, making the charset-detector useless (because it needs the BOM to identify the string as UTF) then returning a random encoding. The solution IMO is:

A) making taglib to return the untouched version of the string tag (with the BOM if it have one) to make the charset-detector more reliable (probably not the best solution as this changes the expected behaviour in other apps than uses taglib)

B) that taglib provides a function to check if the original string was an UTF one (8 or 16) so amarok can skip the charset-detector when isnt needed. I am already doing this with the patch that i provided: the isLatin1 func of taglib merely tests if the string isnt unicode (all chars < 256), so amarok can choose to skip the charset-detector.

P.D.: there is a almost duplicate code of the collectionbrowser in src/meta/file/File_p.h (around line 227) that i dont know if it can trigger this bug in another part of amarok.

Sorry for my english.

Comment 32 Myriam Schweingruber 2009-11-16 15:59:59 UTC

Changing target

Comment 33 Ignacio Serantes 2009-11-20 13:57:19 UTC

Sorry Jeff, I forgot to write a report last week :(

But the good news is that your trick works fine so this confirm that bug is located in encoding detection system that, fails with with certain characters.

If you, or Cesar García, need my help, please, no doubt in contact to me. Thank you to both.

Comment 34 Al Bogner 2009-12-04 14:14:30 UTC

There might be a connection to this bug: https://bugs.kde.org/show_bug.cgi?id=217237

But a lot things are different here. First I have to say, that I have a few songs with Japanese, Arabic and Russian signs in UTF-8 and it is displayed with "????"

For my test scenario I ripped a cd, which is _partially_ ok with Greek signs. There are no old versions of applications. For versions see other bug. AFAIK there is only ID3v2.4 which supports UTF-8, older versions do _not_ support UTF-8 officially, but UTF-16. Compare http://en.wikipedia.org/wiki/ID3

For my test, I didn't use picard (http://en.wikipedia.org/wiki/MusicBrainz_Picard), which I use normally and I did _not_ include it in the collection, but played the files with drag and drop from the desktop. So the question is, if it is related to amarokcollectionscanner.

Also I have to say, I have a lot of files with Greek signs and everything is ok.

To be sure, that ID3v2.4 is used, when I write the tags with python-eyeD3-0.6.17-0.pm.3.1 I use:

eyeD3 --to-v2.4 "$MP3FILE"

eyeD3 --set-encoding=utf8 -a "$CDDBARTISTTAG" -A "$CDDBALBUMTAG" -t "$CDDBTITLETAG" -n "$TRACKNO" -c ::"$COMMENT" "$MP3FILE"

eyeD3 --remove-v1 "$MP3FILE"

eyeD3 is the only application I know, which can write embedded cover images, which are not displayed in Amarok 2 anymore, while it works fine with Amarok 1.4

When I created the mp3s taglib-1.6.1-8.1 and taglib-sharp-2.0.3.2-0.1.1 was installed on the openSUSE 11.1 64bit-system.

Let me know, if you have any further questions.

Comment 35 Jeff Mitchell 2009-12-04 14:20:33 UTC

What exactly is it you're trying to say in the above post? I don't have any idea what you're getting at.

Comment 36 Al Bogner 2009-12-04 16:13:38 UTC

I want to say, that my files, which don't work, are different:

Comment #1 From  Myriam Schweingruber   2009-07-18 12:55:43  (-) [reply]
FWIW, you should use id3v2.4, v1 is very old already.

I use id3v2.4

Comment #19 From  Ignacio Serantes   2009-09-04 21:34:01  (-) [reply]
... I found amarokcollectionscanner :).
... I don't understand why scanner fails with the first and works with the
second

The difference is, that I did _not_ import the files into the collection and I have UTF-8 files too, which are ok, while others are not. And this includes files from the same cd. Compare the screenshots in http://forum.kde.org/viewtopic.php?f=116&t=83916&start=10#p139675


Comment #33 From  Ignacio Serantes   2009-11-20 13:57:19  (-) [reply] -------  
so this confirm that bug is located in encoding detection system that, fails with with certain characters.

The same character is recognized sometimes correctly and sometimes not. Again see screenshots.

HTH, but maybe the bug is already solved. I cannot install amarok-git and to wait, that it is released in KDE-Factory.

Comment 37 Jeff Mitchell 2009-12-04 16:21:25 UTC

Again, I still don't understand the problem. Other people have UTF-8 as well. And if you're not importing into the collection, where are you running into this problem?

Comment 38 Al Bogner 2009-12-04 17:05:51 UTC

I have only UTF-8 coded files and a lot of UTF-8-files with non Latin characters work fine with Amarok 2, but about 2% don't. So I don't say, that UTF-8 doesn't work generally.

All I want to do, is to help you, to solve a bug. Maybe my comments help, maybe not. Maybe you get new ideas, what could be the problem. I cannot install amarok-git, so I cannot say, if the bug is solved in the meantime.

The problem exists when I import the file into the collection, but the problem exists too, when I do not import the file into the collection, but play it directly. So this could be a big difference, when I follow this thread.

I mean I take a folder with mp3-files and and drag it over the amarok icon. This means, that these files from the folder are not under "local music" but in the playlist. And this playlist shows questionmarks. Compare http://web.utanet.at/winter18/amarok2_charset_problem.png You see too, that there are some files with UTF8-tags which are perfect.

While the same files from a nfs-volume are always ok with Amarok 1.4. See http://web.utanet.at/winter18/amarok1.4_charset_ok.png

On the right side, you see the playlist with files, which are not in the collection. On the left side you see the collection, which contains questionmarks too, but the files I mentioned, are not in the collection.

If you like some testfiles, let me know where to send it.

Comment 39 Jeff Mitchell 2009-12-04 17:31:03 UTC

(In reply to comment #38)
> The problem exists when I import the file into the collection, but the problem
> exists too, when I do not import the file into the collection, but play it
> directly. So this could be a big difference, when I follow this thread.

No, it's the exact same thing, because the charset detector is used in multiple places within Amarok. Maybe someone will change the other locations to respect the new toggle before 2.2.

Comment 40 Peter ZHOU 2009-12-07 04:47:26 UTC

(In reply to comment #39)

> No, it's the exact same thing, because the charset detector is used in multiple
> places within Amarok. Maybe someone will change the other locations to respect
> the new toggle before 2.2.

It exists in the scanner and in amarok itself. Just two places.

Comment 41 Jeff Mitchell 2009-12-20 18:00:45 UTC

Executive decision made: default is now for the charset detector to be off. You have to explicitly enable it. That should fix this problem...

Comment 42 bugs-kde 2010-01-15 23:23:26 UTC

What about the old good manual recoding?

Suppose that in the "Edit Track Details" dialog, for each tag value text box, in its right-click menu, there's a "recode from charset:" submenu.
The submenu lists all kinds of encodings (like in Firefox's "View"->"Character Encoding" submenu).

Choosing an encoding performs a recode on the current value. Now the user can close the dialog using "Save & Close" or "Cancel" depending on whether he is satisfied with the result.

This way, you get the basic functionality (which is currently non existent).

Note that you can gradually add more intelligence based on that:

1) By reusing the charset detector code (I suppose), you can supply a list of the most probable encodings based on the original bytes string and promote these encodings to the immediately highest submenu level - while all the other encodings would be buried deeper in a Firefox-style "More Encodings" sub-submenu.
2) I suppose you can plug in some neural network or a simple dynamic rules engine that would learn the user's previous encoding choices and promote the most probable encodings in the menu structure during following invocations.

Not being a QT/KDE developer, I cannot assess how hard would that be but it sure sounds doable to me.

Comment 43 bugs-kde 2010-01-15 23:27:46 UTC

Concerning UI design, using a right-click menu is not a very good idea. It's not obviously exposed to the user and will probably go unnoticed.

But using a menu button, placed next to each editable text box, would cut the mustard.

Comment 44 bugs-kde 2010-01-15 23:36:43 UTC

Since this bug is closed and my proposal being a completely different approach to solving the problem, I've opened a new wish (bug 222912) for that.