Bug 116458

Summary: Non-ASCII strings in Media Library: lowercase/uppercase conversion not working
Product: [Applications] amarok Reporter: Thomas Herzog <thomas.herzog>
Component: generalAssignee: Amarok Developers <amarok-bugs-dist>
Status: RESOLVED INTENTIONAL    
Severity: normal CC: jiha.kde.bugzilla, stanley_87
Priority: NOR    
Version: 1.3.6   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Thomas Herzog 2005-11-16 00:24:24 UTC
Version:           1.3.6 (using KDE KDE 3.5.0)
Installed from:    Unlisted Binary Package

Hi,

I am currently using amaroK 1.3.6 "Guru's binaries" on SuSE 10 with KDE 3.5 RC1.

Having amaroK configured with mySQL, it will NOT filter Non-Ascii characters correctly.

For example:

If I enter "die ärzte" in Media Library searchbar, it will NOT show the group "Die Ärzte".

I can reduce this to the fact that seemingly amaroK does NOT do correct uppercase/lowercase conversion for Non-ASCII characters.

This can be regarded as a severe bug, as this issues affects users all around the world.
Comment 1 Roland Wolters 2005-11-17 16:42:10 UTC
Can confirm this here: Amarok should realize that these two letters are just uppercase and lowercase, but it treats them as if they were different letters.

System:
Suse Linux 10.0, KDE 3.5rc1
Comment 2 Thiago Macieira 2005-11-17 16:57:13 UTC
In other words, amaroK should use QString::localeAwareCompare.
Comment 3 Thomas Herzog 2005-11-17 17:29:20 UTC
Na, not so sure, if this string goes directly in a DB query, it could also be the DB's fault...which could maybe circumvented "somehow"..
Comment 4 Thiago Macieira 2005-11-17 17:41:02 UTC
Right, you have to tell the DB which encoding to use.

MySQL has a lot of combinations of encoding and sorting possibilities. What is your DB configured to?

I don't remember if the setting can be done per database or table.
Comment 5 Thomas Herzog 2005-11-17 22:05:17 UTC
OK, seems like I found it:

A collation can be defined to be used in SELECT statements:

http://dev.mysql.com/doc/refman/4.1/en/case-sensitivity.html

It seems, that amaroK should make the collation configurable which should be used when querying MySQL.

Thus, one should be able to set a collation in the options, which will then be used by amaroK via the COLLATE option in SQL.

Available collations are:
mysql> SHOW COLLATION;
+----------------------+----------+-----+---------+----------+---------+
| Collation            | Charset  | Id  | Default | Compiled | Sortlen |
+----------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci      | big5     |   1 | Yes     | Yes      |       1 |
| big5_bin             | big5     |  84 |         | Yes      |       1 |
| dec8_swedish_ci      | dec8     |   3 | Yes     |          |       0 |
| dec8_bin             | dec8     |  69 |         |          |       0 |
| cp850_general_ci     | cp850    |   4 | Yes     |          |       0 |
| cp850_bin            | cp850    |  80 |         |          |       0 |
| hp8_english_ci       | hp8      |   6 | Yes     |          |       0 |
| hp8_bin              | hp8      |  72 |         |          |       0 |
| koi8r_general_ci     | koi8r    |   7 | Yes     |          |       0 |
| koi8r_bin            | koi8r    |  74 |         |          |       0 |
| latin1_german1_ci    | latin1   |   5 |         |          |       0 |
| latin1_swedish_ci    | latin1   |   8 | Yes     | Yes      |       1 |
| latin1_danish_ci     | latin1   |  15 |         |          |       0 |
| latin1_german2_ci    | latin1   |  31 |         | Yes      |       2 |
| latin1_bin           | latin1   |  47 |         | Yes      |       1 |
| latin1_general_ci    | latin1   |  48 |         |          |       0 |
| latin1_general_cs    | latin1   |  49 |         |          |       0 |
| latin1_spanish_ci    | latin1   |  94 |         |          |       0 |
| latin2_czech_cs      | latin2   |   2 |         | Yes      |       4 |
| latin2_general_ci    | latin2   |   9 | Yes     |          |       0 |
| latin2_hungarian_ci  | latin2   |  21 |         |          |       0 |
| latin2_croatian_ci   | latin2   |  27 |         |          |       0 |
| latin2_bin           | latin2   |  77 |         |          |       0 |
| swe7_swedish_ci      | swe7     |  10 | Yes     |          |       0 |
| swe7_bin             | swe7     |  82 |         |          |       0 |
| ascii_general_ci     | ascii    |  11 | Yes     |          |       0 |
| ascii_bin            | ascii    |  65 |         |          |       0 |
| ujis_japanese_ci     | ujis     |  12 | Yes     | Yes      |       1 |
| ujis_bin             | ujis     |  91 |         | Yes      |       1 |
| sjis_japanese_ci     | sjis     |  13 | Yes     | Yes      |       1 |
| sjis_bin             | sjis     |  88 |         | Yes      |       1 |
| hebrew_general_ci    | hebrew   |  16 | Yes     |          |       0 |
| hebrew_bin           | hebrew   |  71 |         |          |       0 |
| tis620_thai_ci       | tis620   |  18 | Yes     | Yes      |       4 |
| tis620_bin           | tis620   |  89 |         | Yes      |       1 |
| euckr_korean_ci      | euckr    |  19 | Yes     | Yes      |       1 |
| euckr_bin            | euckr    |  85 |         | Yes      |       1 |
| koi8u_general_ci     | koi8u    |  22 | Yes     |          |       0 |
| koi8u_bin            | koi8u    |  75 |         |          |       0 |
| gb2312_chinese_ci    | gb2312   |  24 | Yes     | Yes      |       1 |
| gb2312_bin           | gb2312   |  86 |         | Yes      |       1 |
| greek_general_ci     | greek    |  25 | Yes     |          |       0 |
| greek_bin            | greek    |  70 |         |          |       0 |
| cp1250_general_ci    | cp1250   |  26 | Yes     |          |       0 |
| cp1250_czech_cs      | cp1250   |  34 |         | Yes      |       2 |
| cp1250_croatian_ci   | cp1250   |  44 |         |          |       0 |
| cp1250_bin           | cp1250   |  66 |         |          |       0 |
| gbk_chinese_ci       | gbk      |  28 | Yes     | Yes      |       1 |
| gbk_bin              | gbk      |  87 |         | Yes      |       1 |
| latin5_turkish_ci    | latin5   |  30 | Yes     |          |       0 |
| latin5_bin           | latin5   |  78 |         |          |       0 |
| armscii8_general_ci  | armscii8 |  32 | Yes     |          |       0 |
| armscii8_bin         | armscii8 |  64 |         |          |       0 |
| utf8_general_ci      | utf8     |  33 | Yes     | Yes      |       1 |
| utf8_bin             | utf8     |  83 |         | Yes      |       1 |
| utf8_unicode_ci      | utf8     | 192 |         | Yes      |       8 |
| utf8_icelandic_ci    | utf8     | 193 |         | Yes      |       8 |
| utf8_latvian_ci      | utf8     | 194 |         | Yes      |       8 |
| utf8_romanian_ci     | utf8     | 195 |         | Yes      |       8 |
| utf8_slovenian_ci    | utf8     | 196 |         | Yes      |       8 |
| utf8_polish_ci       | utf8     | 197 |         | Yes      |       8 |
| utf8_estonian_ci     | utf8     | 198 |         | Yes      |       8 |
| utf8_spanish_ci      | utf8     | 199 |         | Yes      |       8 |
| utf8_swedish_ci      | utf8     | 200 |         | Yes      |       8 |
| utf8_turkish_ci      | utf8     | 201 |         | Yes      |       8 |
| utf8_czech_ci        | utf8     | 202 |         | Yes      |       8 |
| utf8_danish_ci       | utf8     | 203 |         | Yes      |       8 |
| utf8_lithuanian_ci   | utf8     | 204 |         | Yes      |       8 |
| utf8_slovak_ci       | utf8     | 205 |         | Yes      |       8 |
| utf8_spanish2_ci     | utf8     | 206 |         | Yes      |       8 |
| utf8_roman_ci        | utf8     | 207 |         | Yes      |       8 |
| utf8_persian_ci      | utf8     | 208 |         | Yes      |       8 |
| ucs2_general_ci      | ucs2     |  35 | Yes     | Yes      |       1 |
| ucs2_bin             | ucs2     |  90 |         | Yes      |       1 |
| ucs2_unicode_ci      | ucs2     | 128 |         | Yes      |       8 |
| ucs2_icelandic_ci    | ucs2     | 129 |         | Yes      |       8 |
| ucs2_latvian_ci      | ucs2     | 130 |         | Yes      |       8 |
| ucs2_romanian_ci     | ucs2     | 131 |         | Yes      |       8 |
| ucs2_slovenian_ci    | ucs2     | 132 |         | Yes      |       8 |
| ucs2_polish_ci       | ucs2     | 133 |         | Yes      |       8 |
| ucs2_estonian_ci     | ucs2     | 134 |         | Yes      |       8 |
| ucs2_spanish_ci      | ucs2     | 135 |         | Yes      |       8 |
| ucs2_swedish_ci      | ucs2     | 136 |         | Yes      |       8 |
| ucs2_turkish_ci      | ucs2     | 137 |         | Yes      |       8 |
| ucs2_czech_ci        | ucs2     | 138 |         | Yes      |       8 |
| ucs2_danish_ci       | ucs2     | 139 |         | Yes      |       8 |
| ucs2_lithuanian_ci   | ucs2     | 140 |         | Yes      |       8 |
| ucs2_slovak_ci       | ucs2     | 141 |         | Yes      |       8 |
| ucs2_spanish2_ci     | ucs2     | 142 |         | Yes      |       8 |
| ucs2_roman_ci        | ucs2     | 143 |         | Yes      |       8 |
| ucs2_persian_ci      | ucs2     | 144 |         | Yes      |       8 |
| cp866_general_ci     | cp866    |  36 | Yes     |          |       0 |
| cp866_bin            | cp866    |  68 |         |          |       0 |
| keybcs2_general_ci   | keybcs2  |  37 | Yes     |          |       0 |
| keybcs2_bin          | keybcs2  |  73 |         |          |       0 |
| macce_general_ci     | macce    |  38 | Yes     |          |       0 |
| macce_bin            | macce    |  43 |         |          |       0 |
| macroman_general_ci  | macroman |  39 | Yes     |          |       0 |
| macroman_bin         | macroman |  53 |         |          |       0 |
| cp852_general_ci     | cp852    |  40 | Yes     |          |       0 |
| cp852_bin            | cp852    |  81 |         |          |       0 |
| latin7_estonian_cs   | latin7   |  20 |         |          |       0 |
| latin7_general_ci    | latin7   |  41 | Yes     |          |       0 |
| latin7_general_cs    | latin7   |  42 |         |          |       0 |
| latin7_bin           | latin7   |  79 |         |          |       0 |
| cp1251_bulgarian_ci  | cp1251   |  14 |         |          |       0 |
| cp1251_ukrainian_ci  | cp1251   |  23 |         |          |       0 |
| cp1251_bin           | cp1251   |  50 |         |          |       0 |
| cp1251_general_ci    | cp1251   |  51 | Yes     |          |       0 |
| cp1251_general_cs    | cp1251   |  52 |         |          |       0 |
| cp1256_general_ci    | cp1256   |  57 | Yes     |          |       0 |
| cp1256_bin           | cp1256   |  67 |         |          |       0 |
| cp1257_lithuanian_ci | cp1257   |  29 |         |          |       0 |
| cp1257_bin           | cp1257   |  58 |         |          |       0 |
| cp1257_general_ci    | cp1257   |  59 | Yes     |          |       0 |
| binary               | binary   |  63 | Yes     | Yes      |       1 |
| geostd8_general_ci   | geostd8  |  92 | Yes     |          |       0 |
| geostd8_bin          | geostd8  |  93 |         |          |       0 |
| cp932_japanese_ci    | cp932    |  95 | Yes     | Yes      |       1 |
| cp932_bin            | cp932    |  96 |         | Yes      |       1 |
+----------------------+----------+-----+---------+----------+---------+

Note the collations "_ci", which are meant to be case-insensitive.

So I think, if amaroK saves as UTF8, the collations utf8_XY_ci and the collations latin1_XY_ci should be at least selectable.

This defines sorting values for all characters of the respective language, thus ü and Ü and ä and Ä and ö and Ö should have the same value -> will be both returned by the LIKE operator...

Would be great, if the devs could include such a configuration fpr MySQL.

It should also apply to PostgreSQL as well!
Comment 6 Thomas Herzog 2005-11-17 22:13:19 UTC
So, detailed analysis follows:

##################################

mysql> select * from artist;
+----+---------------------+
| id | name                |
+----+---------------------+
|  1 | James Newton Howard |
|  2 | Depeche Mode        |
|  3 | Die Ärzte          |
|  4 | H.I.M               |
|  5 | Iron Maiden         |
|  6 | Klaus Badelt        |
|  7 | Nickelback          |
|  8 | Rammstein           |
|  9 | Seeed               |
| 10 | The Bloodhound Gang |
| 11 | The Rolling Stones  |
+----+---------------------+
11 rows in set (0.03 sec)

mysql> select * from artist where name LIKE '%ä%';
Empty set (0.00 sec)

mysql> select * from artist where name LIKE '%Ä%';
+----+------------+
| id | name       |
+----+------------+
|  3 | Die Ärzte |
+----+------------+
1 row in set (0.00 sec)


###############


But reselecting and defining a COLLATION simply does not work correctly. Maybe some friggin MySQL bug *shame on them*

But there is maybe a simple solution to this:

- save a correctly lowercased expression in the database for the string fields?

- do a SELECT with both uppercase and lowercase expressions (will fail, if there are MIXED Cases for different characters in the string)

Or has someone been able to get its MySQL doing the above query with "ä" and "Ä" correctly?
Comment 7 Thomas Herzog 2005-11-17 22:38:54 UTC
Jesus, amaroK could encode strings e.g. in HTML encoding?

So encode the "Umlaute" with &Auml; or &auml;

Any DB should be able to use these values in strings without Problems.

Don't know, if this imposes a restriction in available chars for asian languages?
Comment 8 Thiago Macieira 2005-11-18 03:46:15 UTC
HTML entities won't solve here.
Comment 9 Thiago Macieira 2005-11-21 03:40:42 UTC
*** Bug 115684 has been marked as a duplicate of this bug. ***
Comment 10 Thiago Macieira 2005-11-21 03:41:50 UTC
According to bug 104610, the problem with the MySQL backend is solved.

For other backends, I think you should file wishlists with the backend authors to have the set up proper case insensitive lookups.
Comment 11 Thomas Herzog 2005-11-21 15:57:27 UTC
Hm, the "bug solved" on Bug 104610  is from 09/23/2005

I can reproduce the Bug on amaroK 1.3.6 release 1, dating 10/08/2005...

This hardly qualifies as "fixed" ?!?!
Comment 12 Paul Betts 2005-11-21 16:35:48 UTC
This bug still exists using the Sqlite engine, is this a bug with the way AmaroK handles Sqlite, or is Sqlite just broken with regards to this issue?
Comment 13 Jeff Mitchell 2005-12-05 18:06:18 UTC
Unfortunately this is a SQLite issue.  From sqlite.org, and note the section in parentheses marked "a bug":

"The LIKE operator is not case sensitive and will match upper case characters on one side against lower case characters on the other. (A bug: SQLite only understands upper/lower case for 7-bit Latin characters. Hence the LIKE operator is case sensitive for 8-bit iso8859 characters or UTF-8 characters. For example, the expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE.)."

This means that unless items are stored (incorrectly) as all upper or lowercase in the database, SQLite cannot handle the type of behavior desired and reported in this bug.

If this is fixed in SQLite it can be fixed in amaroK; until then you'll need to use something else (i.e. MySQL) if you want the correct behavior.  Also, you can pester the SQLite guys to fix it.