116458 – Non-ASCII strings in Media Library: lowercase/uppercase conversion not working

Bug 116458 - Non-ASCII strings in Media Library: lowercase/uppercase conversion not working

Summary: Non-ASCII strings in Media Library: lowercase/uppercase conversion not working

Status:	RESOLVED INTENTIONAL

Alias:	None

Product:	amarok
Classification:	Applications
Component:	general (show other bugs)
Version:	1.3.6
Platform:	unspecified Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Amarok Developers

URL:
Keywords:

Duplicates (1):	115684 (view as bug list)
Depends on:
Blocks:

Reported:	2005-11-16 00:24 UTC by Thomas Herzog
Modified:	2006-11-12 04:01 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Thomas Herzog 2005-11-16 00:24:24 UTC

Version:           1.3.6 (using KDE KDE 3.5.0)
Installed from:    Unlisted Binary Package

Hi,

I am currently using amaroK 1.3.6 "Guru's binaries" on SuSE 10 with KDE 3.5 RC1.

Having amaroK configured with mySQL, it will NOT filter Non-Ascii characters correctly.

For example:

If I enter "die ärzte" in Media Library searchbar, it will NOT show the group "Die Ärzte".

I can reduce this to the fact that seemingly amaroK does NOT do correct uppercase/lowercase conversion for Non-ASCII characters.

This can be regarded as a severe bug, as this issues affects users all around the world.

Comment 1 Roland Wolters 2005-11-17 16:42:10 UTC

Can confirm this here: Amarok should realize that these two letters are just uppercase and lowercase, but it treats them as if they were different letters.

System:
Suse Linux 10.0, KDE 3.5rc1

Comment 2 Thiago Macieira 2005-11-17 16:57:13 UTC

In other words, amaroK should use QString::localeAwareCompare.

Comment 3 Thomas Herzog 2005-11-17 17:29:20 UTC

Na, not so sure, if this string goes directly in a DB query, it could also be the DB's fault...which could maybe circumvented "somehow"..

Comment 4 Thiago Macieira 2005-11-17 17:41:02 UTC

Right, you have to tell the DB which encoding to use.

MySQL has a lot of combinations of encoding and sorting possibilities. What is your DB configured to?

I don't remember if the setting can be done per database or table.

Comment 5 Thomas Herzog 2005-11-17 22:05:17 UTC

OK, seems like I found it:

A collation can be defined to be used in SELECT statements:

http://dev.mysql.com/doc/refman/4.1/en/case-sensitivity.html

It seems, that amaroK should make the collation configurable which should be used when querying MySQL.

Thus, one should be able to set a collation in the options, which will then be used by amaroK via the COLLATE option in SQL.

Available collations are:
mysql> SHOW COLLATION;
+----------------------+----------+-----+---------+----------+---------+
| Collation            | Charset  | Id  | Default | Compiled | Sortlen |
+----------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci      | big5     |   1 | Yes     | Yes      |       1 |
| big5_bin             | big5     |  84 |         | Yes      |       1 |
| dec8_swedish_ci      | dec8     |   3 | Yes     |          |       0 |
| dec8_bin             | dec8     |  69 |         |          |       0 |
| cp850_general_ci     | cp850    |   4 | Yes     |          |       0 |
| cp850_bin            | cp850    |  80 |         |          |       0 |
| hp8_english_ci       | hp8      |   6 | Yes     |          |       0 |
| hp8_bin              | hp8      |  72 |         |          |       0 |
| koi8r_general_ci     | koi8r    |   7 | Yes     |          |       0 |
| koi8r_bin            | koi8r    |  74 |         |          |       0 |
| latin1_german1_ci    | latin1   |   5 |         |          |       0 |
| latin1_swedish_ci    | latin1   |   8 | Yes     | Yes      |       1 |
| latin1_danish_ci     | latin1   |  15 |         |          |       0 |
| latin1_german2_ci    | latin1   |  31 |         | Yes      |       2 |
| latin1_bin           | latin1   |  47 |         | Yes      |       1 |
| latin1_general_ci    | latin1   |  48 |         |          |       0 |
| latin1_general_cs    | latin1   |  49 |         |          |       0 |
| latin1_spanish_ci    | latin1   |  94 |         |          |       0 |
| latin2_czech_cs      | latin2   |   2 |         | Yes      |       4 |
| latin2_general_ci    | latin2   |   9 | Yes     |          |       0 |
| latin2_hungarian_ci  | latin2   |  21 |         |          |       0 |
| latin2_croatian_ci   | latin2   |  27 |         |          |       0 |
| latin2_bin           | latin2   |  77 |         |          |       0 |
| swe7_swedish_ci      | swe7     |  10 | Yes     |          |       0 |
| swe7_bin             | swe7     |  82 |         |          |       0 |
| ascii_general_ci     | ascii    |  11 | Yes     |          |       0 |
| ascii_bin            | ascii    |  65 |         |          |       0 |
| ujis_japanese_ci     | ujis     |  12 | Yes     | Yes      |       1 |
| ujis_bin             | ujis     |  91 |         | Yes      |       1 |
| sjis_japanese_ci     | sjis     |  13 | Yes     | Yes      |       1 |
| sjis_bin             | sjis     |  88 |         | Yes      |       1 |
| hebrew_general_ci    | hebrew   |  16 | Yes     |          |       0 |
| hebrew_bin           | hebrew   |  71 |         |          |       0 |
| tis620_thai_ci       | tis620   |  18 | Yes     | Yes      |       4 |
| tis620_bin           | tis620   |  89 |         | Yes      |       1 |
| euckr_korean_ci      | euckr    |  19 | Yes     | Yes      |       1 |
| euckr_bin            | euckr    |  85 |         | Yes      |       1 |
| koi8u_general_ci     | koi8u    |  22 | Yes     |          |       0 |
| koi8u_bin            | koi8u    |  75 |         |          |       0 |
| gb2312_chinese_ci    | gb2312   |  24 | Yes     | Yes      |       1 |
| gb2312_bin           | gb2312   |  86 |         | Yes      |       1 |
| greek_general_ci     | greek    |  25 | Yes     |          |       0 |
| greek_bin            | greek    |  70 |         |          |       0 |
| cp1250_general_ci    | cp1250   |  26 | Yes     |          |       0 |
| cp1250_czech_cs      | cp1250   |  34 |         | Yes      |       2 |
| cp1250_croatian_ci   | cp1250   |  44 |         |          |       0 |
| cp1250_bin           | cp1250   |  66 |         |          |       0 |
| gbk_chinese_ci       | gbk      |  28 | Yes     | Yes      |       1 |
| gbk_bin              | gbk      |  87 |         | Yes      |       1 |
| latin5_turkish_ci    | latin5   |  30 | Yes     |          |       0 |
| latin5_bin           | latin5   |  78 |         |          |       0 |
| armscii8_general_ci  | armscii8 |  32 | Yes     |          |       0 |
| armscii8_bin         | armscii8 |  64 |         |          |       0 |
| utf8_general_ci      | utf8     |  33 | Yes     | Yes      |       1 |
| utf8_bin             | utf8     |  83 |         | Yes      |       1 |
| utf8_unicode_ci      | utf8     | 192 |         | Yes      |       8 |
| utf8_icelandic_ci    | utf8     | 193 |         | Yes      |       8 |
| utf8_latvian_ci      | utf8     | 194 |         | Yes      |       8 |
| utf8_romanian_ci     | utf8     | 195 |         | Yes      |       8 |
| utf8_slovenian_ci    | utf8     | 196 |         | Yes      |       8 |
| utf8_polish_ci       | utf8     | 197 |         | Yes      |       8 |
| utf8_estonian_ci     | utf8     | 198 |         | Yes      |       8 |
| utf8_spanish_ci      | utf8     | 199 |         | Yes      |       8 |
| utf8_swedish_ci      | utf8     | 200 |         | Yes      |       8 |
| utf8_turkish_ci      | utf8     | 201 |         | Yes      |       8 |
| utf8_czech_ci        | utf8     | 202 |         | Yes      |       8 |
| utf8_danish_ci       | utf8     | 203 |         | Yes      |       8 |
| utf8_lithuanian_ci   | utf8     | 204 |         | Yes      |       8 |
| utf8_slovak_ci       | utf8     | 205 |         | Yes      |       8 |
| utf8_spanish2_ci     | utf8     | 206 |         | Yes      |       8 |
| utf8_roman_ci        | utf8     | 207 |         | Yes      |       8 |
| utf8_persian_ci      | utf8     | 208 |         | Yes      |       8 |
| ucs2_general_ci      | ucs2     |  35 | Yes     | Yes      |       1 |
| ucs2_bin             | ucs2     |  90 |         | Yes      |       1 |
| ucs2_unicode_ci      | ucs2     | 128 |         | Yes      |       8 |
| ucs2_icelandic_ci    | ucs2     | 129 |         | Yes      |       8 |
| ucs2_latvian_ci      | ucs2     | 130 |         | Yes      |       8 |
| ucs2_romanian_ci     | ucs2     | 131 |         | Yes      |       8 |
| ucs2_slovenian_ci    | ucs2     | 132 |         | Yes      |       8 |
| ucs2_polish_ci       | ucs2     | 133 |         | Yes      |       8 |
| ucs2_estonian_ci     | ucs2     | 134 |         | Yes      |       8 |
| ucs2_spanish_ci      | ucs2     | 135 |         | Yes      |       8 |
| ucs2_swedish_ci      | ucs2     | 136 |         | Yes      |       8 |
| ucs2_turkish_ci      | ucs2     | 137 |         | Yes      |       8 |
| ucs2_czech_ci        | ucs2     | 138 |         | Yes      |       8 |
| ucs2_danish_ci       | ucs2     | 139 |         | Yes      |       8 |
| ucs2_lithuanian_ci   | ucs2     | 140 |         | Yes      |       8 |
| ucs2_slovak_ci       | ucs2     | 141 |         | Yes      |       8 |
| ucs2_spanish2_ci     | ucs2     | 142 |         | Yes      |       8 |
| ucs2_roman_ci        | ucs2     | 143 |         | Yes      |       8 |
| ucs2_persian_ci      | ucs2     | 144 |         | Yes      |       8 |
| cp866_general_ci     | cp866    |  36 | Yes     |          |       0 |
| cp866_bin            | cp866    |  68 |         |          |       0 |
| keybcs2_general_ci   | keybcs2  |  37 | Yes     |          |       0 |
| keybcs2_bin          | keybcs2  |  73 |         |          |       0 |
| macce_general_ci     | macce    |  38 | Yes     |          |       0 |
| macce_bin            | macce    |  43 |         |          |       0 |
| macroman_general_ci  | macroman |  39 | Yes     |          |       0 |
| macroman_bin         | macroman |  53 |         |          |       0 |
| cp852_general_ci     | cp852    |  40 | Yes     |          |       0 |
| cp852_bin            | cp852    |  81 |         |          |       0 |
| latin7_estonian_cs   | latin7   |  20 |         |          |       0 |
| latin7_general_ci    | latin7   |  41 | Yes     |          |       0 |
| latin7_general_cs    | latin7   |  42 |         |          |       0 |
| latin7_bin           | latin7   |  79 |         |          |       0 |
| cp1251_bulgarian_ci  | cp1251   |  14 |         |          |       0 |
| cp1251_ukrainian_ci  | cp1251   |  23 |         |          |       0 |
| cp1251_bin           | cp1251   |  50 |         |          |       0 |
| cp1251_general_ci    | cp1251   |  51 | Yes     |          |       0 |
| cp1251_general_cs    | cp1251   |  52 |         |          |       0 |
| cp1256_general_ci    | cp1256   |  57 | Yes     |          |       0 |
| cp1256_bin           | cp1256   |  67 |         |          |       0 |
| cp1257_lithuanian_ci | cp1257   |  29 |         |          |       0 |
| cp1257_bin           | cp1257   |  58 |         |          |       0 |
| cp1257_general_ci    | cp1257   |  59 | Yes     |          |       0 |
| binary               | binary   |  63 | Yes     | Yes      |       1 |
| geostd8_general_ci   | geostd8  |  92 | Yes     |          |       0 |
| geostd8_bin          | geostd8  |  93 |         |          |       0 |
| cp932_japanese_ci    | cp932    |  95 | Yes     | Yes      |       1 |
| cp932_bin            | cp932    |  96 |         | Yes      |       1 |
+----------------------+----------+-----+---------+----------+---------+

Note the collations "_ci", which are meant to be case-insensitive.

So I think, if amaroK saves as UTF8, the collations utf8_XY_ci and the collations latin1_XY_ci should be at least selectable.

This defines sorting values for all characters of the respective language, thus ü and Ü and ä and Ä and ö and Ö should have the same value -> will be both returned by the LIKE operator...

Would be great, if the devs could include such a configuration fpr MySQL.

It should also apply to PostgreSQL as well!

Comment 6 Thomas Herzog 2005-11-17 22:13:19 UTC

So, detailed analysis follows:

##################################

mysql> select * from artist;
+----+---------------------+
| id | name                |
+----+---------------------+
|  1 | James Newton Howard |
|  2 | Depeche Mode        |
|  3 | Die Ärzte          |
|  4 | H.I.M               |
|  5 | Iron Maiden         |
|  6 | Klaus Badelt        |
|  7 | Nickelback          |
|  8 | Rammstein           |
|  9 | Seeed               |
| 10 | The Bloodhound Gang |
| 11 | The Rolling Stones  |
+----+---------------------+
11 rows in set (0.03 sec)

mysql> select * from artist where name LIKE '%ä%';
Empty set (0.00 sec)

mysql> select * from artist where name LIKE '%Ä%';
+----+------------+
| id | name       |
+----+------------+
|  3 | Die Ärzte |
+----+------------+
1 row in set (0.00 sec)


###############


But reselecting and defining a COLLATION simply does not work correctly. Maybe some friggin MySQL bug *shame on them*

But there is maybe a simple solution to this:

- save a correctly lowercased expression in the database for the string fields?

- do a SELECT with both uppercase and lowercase expressions (will fail, if there are MIXED Cases for different characters in the string)

Or has someone been able to get its MySQL doing the above query with "ä" and "Ä" correctly?

Comment 7 Thomas Herzog 2005-11-17 22:38:54 UTC

Jesus, amaroK could encode strings e.g. in HTML encoding?

So encode the "Umlaute" with &Auml; or &auml;

Any DB should be able to use these values in strings without Problems.

Don't know, if this imposes a restriction in available chars for asian languages?

Comment 8 Thiago Macieira 2005-11-18 03:46:15 UTC

HTML entities won't solve here.

Comment 9 Thiago Macieira 2005-11-21 03:40:42 UTC

*** Bug 115684 has been marked as a duplicate of this bug. ***

Comment 10 Thiago Macieira 2005-11-21 03:41:50 UTC

According to bug 104610, the problem with the MySQL backend is solved.

For other backends, I think you should file wishlists with the backend authors to have the set up proper case insensitive lookups.

Comment 11 Thomas Herzog 2005-11-21 15:57:27 UTC

Hm, the "bug solved" on Bug 104610  is from 09/23/2005

I can reproduce the Bug on amaroK 1.3.6 release 1, dating 10/08/2005...

This hardly qualifies as "fixed" ?!?!

Comment 12 Paul Betts 2005-11-21 16:35:48 UTC

This bug still exists using the Sqlite engine, is this a bug with the way AmaroK handles Sqlite, or is Sqlite just broken with regards to this issue?

Comment 13 Jeff Mitchell 2005-12-05 18:06:18 UTC

Unfortunately this is a SQLite issue.  From sqlite.org, and note the section in parentheses marked "a bug":

"The LIKE operator is not case sensitive and will match upper case characters on one side against lower case characters on the other. (A bug: SQLite only understands upper/lower case for 7-bit Latin characters. Hence the LIKE operator is case sensitive for 8-bit iso8859 characters or UTF-8 characters. For example, the expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE.)."

This means that unless items are stored (incorrectly) as all upper or lowercase in the database, SQLite cannot handle the type of behavior desired and reported in this bug.

If this is fixed in SQLite it can be fixed in amaroK; until then you'll need to use something else (i.e. MySQL) if you want the correct behavior.  Also, you can pester the SQLite guys to fix it.