Version: 2.1.1 (using 4.2.96 (KDE 4.2.96 (KDE 4.3 RC2)) "release 142", KDE:KDE4:Factory:Desktop / openSUSE_11.1) Compiler: gcc OS: Linux (x86_64) release 2.6.27.23-0.1-default In last versions japanese and korean (and probably other UTF-8 characters) characters are not recognized from id3v1 and Amarok displays ???. This don't happen with previous versions. If you change names using Amarok id3v1 tags are deleted when japanese characters was available. A few encoding test in Amarok with non english characters would be a good idea.
FWIW, you should use id3v2.4, v1 is very old already. This is fixed in 2.2-SVN
Well, the problem is that my mp3 has both versions so, if version 1 is very old, why Amarok look first in version 1 if version 2 is fine?
(In reply to comment #2) > Well, the problem is that my mp3 has both versions so, if version 1 is very > old, why Amarok look first in version 1 if version 2 is fine? You should convert to one single format to avoid problems IMHO. Amarok scans for id3 tags in general, but for tag consistency make sure you have only one version. Applications like kid3 or easytag allow you to convert to a newer version easily.
I'm using kid3 for tagging and both tags have the same values. I can't only use one version because my car mp3 player software can't read version 2 tags. I still can't see the problem of the next lines of code if can't read_id3v2 read_id3v1 end and the problem is solved.
Did you read what I said in comment #2, the part about: "that is solved in 2.2-SVN"?
Yes, I'm reply comment #3 "you should convert to one single format"
Bug is not fixed, because reading tags of some files fails. I'm currently using Amarok 2.2-git and I have problems reading information with Amarok. Amarok 1.4, Exaile, mplayer and Kid3 have no problem with this files but Amarok can't read correctly. I do the next test: 1) Generate tags with amarok: fails 2) Use Kid3 with UTF-8: fails 3) Convert to version 2.4: fails 4) Convert to version 2.3: fails 5) Erase 1.1 tag: fails After all of these test I do the next process: 1) Erase Amarok database. 2) Run again Amarok. This is a regresion because with Amarok 2.1 I have no problems with this files.
I don't know it this can help but I'm using official openSUSE 11.1 taglib packages: taglib 1.5-18.89 taglib-devel 1.5-18.89 taglib-extras 0.1.6-7.1 taglib-extras-devel 0.1.6-7.1
When Amarok scans for music sometimes the next message appears on console: TagLib: String::prepare() - Unicode conversion error.
Jeff, could you have a look at that eventually? Could be a taglib problem I guess.
I don't know if the next can be useful but when I build Amarok the next warnings are displayed several times: /usr/include/taglib/id3v1tag.h:61: warning: ‘class TagLib::ID3v1::StringHandler’ has virtual functions and accessible non-virtual destructor /usr/include/taglib/fileref.h:93: warning: ‘class TagLib::FileRef::FileTypeResolver’ has virtual functions and accessible non-virtual destructor
Looking for taglib I found the next page http://developer.kde.org/~wheeler/taglib.html and I try adding bad files to JuK, a test that I don't done before, and JuK reads tags without problems.
This is probably not related to my stuff but rather Peter's work with the charset detector. I would strongly suggest providing one of the affected files for us to look at, since without that I can't even begin to hazard a guess what the issue is. I can tell you that: a) AFAIK Amarok does prefer ID3v2 tags, and b) UTF-8 is never valid in ID3v1
(In reply to comment #13) > This is probably not related to my stuff but rather Peter's work with the > charset detector. > > I would strongly suggest providing one of the affected files for us to look at, > since without that I can't even begin to hazard a guess what the issue is. > > I can tell you that: > a) AFAIK Amarok does prefer ID3v2 tags, and > b) UTF-8 is never valid in ID3v1 I have no problem in send one of this files. I can tell you that: a) This must be the correct behaviour, I think. b) kid3, id3v2 and id3info supports UTF8 in ID3v1 and, based in a) these wrong? encoding must be ignored by Amarok because it prefers ID3v2. On the other side, I don't detect any prohibition about UTF-8 usage here http://mpgedit.org/mpgedit/mpeg_format/mpeghdr.htm but seems to be assumed that ID3v1 use ISO-8859-1 but not for all programs.
Note the caveat in the document: "I do not claim information presented in this document is accurate. At first I just gathered it from different sources. It was not an easy task but I needed it. Later, I received lots of comments as feedback when I published this document. I think this last release is highly accurate due to comments and corrections I received. " The comment entirely backs up the point I was going to make, which is: there is no official ID3v1 spec. There never was. There's just a mishmash of specs various people have written up and what people end up using and hope that it works with other players. Sometimes it does, sometimes it doesn't. Following on, UTF-8 is definitely not at all standardized in ID3v1. The reason so may (especially cheap) hardware players use ID3v1 is because it's a fixed number of bytes at the end of the file, which makes hardware implementation rather trivial. But that doesn't change that it's a hardware implementation of a non-spec. Since you can't provide files exhibiting problems that Peter can look at with the charset detector, marking as INVALID...if you can drum up some files, re-open.
My english must be deteriorating because I have bad files and I can send it. I don't know why do you think that "I can't provide files'. Tell me where I must send it.
send a copy to me please: peterzhoulei AT gmail Thanks.
(In reply to comment #17) > send a copy to me please: peterzhoulei AT gmail > Thanks. I send you the mp3. My database is in a Mysql server so, if you wish, I can add an account for you.
I'm learning a little more about Amarok and I found amarokcollectionscanner :). This is the output for the file I send you and for the second track of the single. I don't understand why scanner fails with the first and works with the second. <tags title="????" compilation="checkforvarious" bitrate="320" uniqueid="amarok-sqltrackuid://e18c9f1aa9ee0fe3c35d9f4d78f2d3f1" filetype="0" track="1" artist="????" path="/storage/storage/music/mp3/jpop/高田梢枝/Eureka 7 ED Single - 秘密基地/01 - Himitsu Kichi.mp3" filesize="11850591" length="296" samplerate="44100" album="????" comment="" genre="Jpop" year="2005" audioproperties="true" /> <tags title="インコ" compilation="checkforvarious" bitrate="320" uniqueid="amarok-sqltrackuid://f1359d5f9b0992a7feaf2b3069d890b4" filetype="0" track="2" artist="高田梢枝" path="/storage/storage/music/mp3/jpop/高田梢枝/Eureka 7 ED Single - 秘密基地/02 - Inko.mp3" filesize="15724544" length="393" samplerate="44100" album="秘密基地" comment="" genre="Jpop" year="2005" audioproperties="true" />
Thanks at #19 for the tip. Looks like the charset-detector routine in CollectionScanner.cpp is broken, but only because already UTF(8-16) strings are passed to the chardet_get_charset function and because those strings are missing the unicode BOM the function fails at identify the string as UTF-16 (on my particular case) and gets a incorrect encoding (gb18030) then encodes the string incorrectly as utf8 and obtains garbage. I made a patch that skip the charset-detector part if a non Latin1 string is detected and now my collection is displayed correctly, i am not sure if this works for all the encoding, but for my collection of japanese UTF-16, iso8859-1 and some Shift_JIS tags it works perfectly.
Created attachment 37107 [details] Fix the encoding problem in amarokcollectionscanner
Jeff, could you please have a look at this patch?
Cesar I try your patch and works. Great work! :D.
I'm not the right person to look at this. Peter?
*** Bug 210101 has been marked as a duplicate of this bug. ***
I can confirm Cesar's patch works for me. All the ??? in the collection are now shown correctly after a full rescan.
Two confirmations tha patch works, three if we count Cesar, but Peter seems to be so busy to test the patch. Is there any way that I can help to confirm that the patch is correct?
Any news about this bug. Was reported five months ago and even there is a patch that solves the problem. In my home I compile Amarok so I can patch it but I can't do that in my office.
Hi, We haven't heard back from Peter about this and I'm hesitant to use the patch because latin1 doesn't cover all characters that may be seen (such as Cyrillic characters). But I also admit I'm not entirely sure why the code is doing what it does -- for instance, Amarok supports UTF-8 strings, so why does it convert the strings to unicode, then back to latin1? Overall, I'm not really comfortable with forcing this on users, even though it does help some. Yesterday I independently added the ability to skip the charset detector during scan; for people with properly tagged files this should prevent issues like this from happening. It's off by default (meaning don't run the charset detector) although for people with existing Amarok installs it defaults to on (to keep the current behavior). You can turn it off by going to Settings->Collection. I couldn't do this earlier because we were in string freeze for 2.2.1 and I couldn't introduce the new string. But this will be in 2.2.2 and is in current Git. I'm still open to applying this patch, but I need someone in the know to convince me why it's right, other than that it works for your particular files. :-) Aren't there encodings for which they could be handled but with charsets outside of latin1? And, does anyone know why it converts to utf8 and then back to latin1, and if this is being a detriment or if simply removing those toLatin1 calls would possibly fix this?
Jeff, thank you for your effort. I will try your changes this weekend.
I dont know really if the problem is more of taglib or amarok. Seeing the code of the ::prepare func in taglib/toolkit/tstring.cpp, it strips the unicode BOM if it finds one, making the charset-detector useless (because it needs the BOM to identify the string as UTF) then returning a random encoding. The solution IMO is: A) making taglib to return the untouched version of the string tag (with the BOM if it have one) to make the charset-detector more reliable (probably not the best solution as this changes the expected behaviour in other apps than uses taglib) B) that taglib provides a function to check if the original string was an UTF one (8 or 16) so amarok can skip the charset-detector when isnt needed. I am already doing this with the patch that i provided: the isLatin1 func of taglib merely tests if the string isnt unicode (all chars < 256), so amarok can choose to skip the charset-detector. P.D.: there is a almost duplicate code of the collectionbrowser in src/meta/file/File_p.h (around line 227) that i dont know if it can trigger this bug in another part of amarok. Sorry for my english.
Changing target
Sorry Jeff, I forgot to write a report last week :( But the good news is that your trick works fine so this confirm that bug is located in encoding detection system that, fails with with certain characters. If you, or Cesar García, need my help, please, no doubt in contact to me. Thank you to both.
There might be a connection to this bug: https://bugs.kde.org/show_bug.cgi?id=217237 But a lot things are different here. First I have to say, that I have a few songs with Japanese, Arabic and Russian signs in UTF-8 and it is displayed with "????" For my test scenario I ripped a cd, which is _partially_ ok with Greek signs. There are no old versions of applications. For versions see other bug. AFAIK there is only ID3v2.4 which supports UTF-8, older versions do _not_ support UTF-8 officially, but UTF-16. Compare http://en.wikipedia.org/wiki/ID3 For my test, I didn't use picard (http://en.wikipedia.org/wiki/MusicBrainz_Picard), which I use normally and I did _not_ include it in the collection, but played the files with drag and drop from the desktop. So the question is, if it is related to amarokcollectionscanner. Also I have to say, I have a lot of files with Greek signs and everything is ok. To be sure, that ID3v2.4 is used, when I write the tags with python-eyeD3-0.6.17-0.pm.3.1 I use: eyeD3 --to-v2.4 "$MP3FILE" eyeD3 --set-encoding=utf8 -a "$CDDBARTISTTAG" -A "$CDDBALBUMTAG" -t "$CDDBTITLETAG" -n "$TRACKNO" -c ::"$COMMENT" "$MP3FILE" eyeD3 --remove-v1 "$MP3FILE" eyeD3 is the only application I know, which can write embedded cover images, which are not displayed in Amarok 2 anymore, while it works fine with Amarok 1.4 When I created the mp3s taglib-1.6.1-8.1 and taglib-sharp-2.0.3.2-0.1.1 was installed on the openSUSE 11.1 64bit-system. Let me know, if you have any further questions.
What exactly is it you're trying to say in the above post? I don't have any idea what you're getting at.
I want to say, that my files, which don't work, are different: Comment #1 From Myriam Schweingruber 2009-07-18 12:55:43 (-) [reply] FWIW, you should use id3v2.4, v1 is very old already. I use id3v2.4 Comment #19 From Ignacio Serantes 2009-09-04 21:34:01 (-) [reply] ... I found amarokcollectionscanner :). ... I don't understand why scanner fails with the first and works with the second The difference is, that I did _not_ import the files into the collection and I have UTF-8 files too, which are ok, while others are not. And this includes files from the same cd. Compare the screenshots in http://forum.kde.org/viewtopic.php?f=116&t=83916&start=10#p139675 Comment #33 From Ignacio Serantes 2009-11-20 13:57:19 (-) [reply] ------- so this confirm that bug is located in encoding detection system that, fails with with certain characters. The same character is recognized sometimes correctly and sometimes not. Again see screenshots. HTH, but maybe the bug is already solved. I cannot install amarok-git and to wait, that it is released in KDE-Factory.
Again, I still don't understand the problem. Other people have UTF-8 as well. And if you're not importing into the collection, where are you running into this problem?
I have only UTF-8 coded files and a lot of UTF-8-files with non Latin characters work fine with Amarok 2, but about 2% don't. So I don't say, that UTF-8 doesn't work generally. All I want to do, is to help you, to solve a bug. Maybe my comments help, maybe not. Maybe you get new ideas, what could be the problem. I cannot install amarok-git, so I cannot say, if the bug is solved in the meantime. The problem exists when I import the file into the collection, but the problem exists too, when I do not import the file into the collection, but play it directly. So this could be a big difference, when I follow this thread. I mean I take a folder with mp3-files and and drag it over the amarok icon. This means, that these files from the folder are not under "local music" but in the playlist. And this playlist shows questionmarks. Compare http://web.utanet.at/winter18/amarok2_charset_problem.png You see too, that there are some files with UTF8-tags which are perfect. While the same files from a nfs-volume are always ok with Amarok 1.4. See http://web.utanet.at/winter18/amarok1.4_charset_ok.png On the right side, you see the playlist with files, which are not in the collection. On the left side you see the collection, which contains questionmarks too, but the files I mentioned, are not in the collection. If you like some testfiles, let me know where to send it.
(In reply to comment #38) > The problem exists when I import the file into the collection, but the problem > exists too, when I do not import the file into the collection, but play it > directly. So this could be a big difference, when I follow this thread. No, it's the exact same thing, because the charset detector is used in multiple places within Amarok. Maybe someone will change the other locations to respect the new toggle before 2.2.
(In reply to comment #39) > No, it's the exact same thing, because the charset detector is used in multiple > places within Amarok. Maybe someone will change the other locations to respect > the new toggle before 2.2. It exists in the scanner and in amarok itself. Just two places.
Executive decision made: default is now for the charset detector to be off. You have to explicitly enable it. That should fix this problem...
What about the old good manual recoding? Suppose that in the "Edit Track Details" dialog, for each tag value text box, in its right-click menu, there's a "recode from charset:" submenu. The submenu lists all kinds of encodings (like in Firefox's "View"->"Character Encoding" submenu). Choosing an encoding performs a recode on the current value. Now the user can close the dialog using "Save & Close" or "Cancel" depending on whether he is satisfied with the result. This way, you get the basic functionality (which is currently non existent). Note that you can gradually add more intelligence based on that: 1) By reusing the charset detector code (I suppose), you can supply a list of the most probable encodings based on the original bytes string and promote these encodings to the immediately highest submenu level - while all the other encodings would be buried deeper in a Firefox-style "More Encodings" sub-submenu. 2) I suppose you can plug in some neural network or a simple dynamic rules engine that would learn the user's previous encoding choices and promote the most probable encodings in the menu structure during following invocations. Not being a QT/KDE developer, I cannot assess how hard would that be but it sure sounds doable to me.
Concerning UI design, using a right-click menu is not a very good idea. It's not obviously exposed to the user and will probably go unnoticed. But using a menu button, placed next to each editable text box, would cut the mustard.
Since this bug is closed and my proposal being a completely different approach to solving the problem, I've opened a new wish (bug 222912) for that.