Bug 188187 - Use UTF-8 charset in MySQL database instead of Latin_1
Summary: Use UTF-8 charset in MySQL database instead of Latin_1
Status: RESOLVED INTENTIONAL
Alias: None
Product: Akonadi
Classification: Frameworks and Libraries
Component: server (show other bugs)
Version: unspecified
Platform: Gentoo Packages Linux
: NOR wishlist
Target Milestone: ---
Assignee: Volker Krause
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-26 19:48 UTC by rene
Modified: 2009-03-30 22:10 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description rene 2009-03-26 19:48:22 UTC
Version:           akonadi-server-1.1.1/akonadi-4.2.1 (using KDE 4.2.1)
Compiler:          gcc version 4.1.2 (Gentoo 4.1.2 p1.3) 
OS:                Linux
Installed from:    Gentoo Packages

The (KDE local) Arkonadi MySQL tables use the Latin_1 charset.

This limits the use of KDE-PIM/Akonadi to West-European languages.

Please use UTF-8 with utf8_general_ci collation instead.


KDE ist great!

Renne
Comment 1 Volker Krause 2009-03-28 16:03:53 UTC
Where do you see actual bugs caused by this and how do you trigger them?

We should have unit tests that cover non-latin1 strings for all fields that can contain them, which work here.
Comment 2 rene 2009-03-30 19:22:11 UTC
I have not found a actual bug but realized this when I tried to figure out the structure of the database. It is no bug but a design error.

Adding unit tests for Latin_1 will not solve the problem. If you want to store e.g. asian characters in a Latin_1 database you'll get garbled characters anyway.

The character set of Latin_1 is smaller than UTF-8/Unicode. So you will loose information by all means when transforming unicode text to Latin_1. Using Latin_1 in a backend makes KDE unusable for all non-west-european users.

Unicode was created to solve charset problems. Windoze and Java use UTF-16, which misses some asian languages. UTF-32 covers nearly all known characters on this planet (even hieroglyphs, mathematical and musical notation), but needs four bytes for each character. Because of that UNIX systems switch to UTF-8 which is a variable-length character encoding form of UTF-32. The IETF requires all new protocols to support UTF-8 (RFC 2277).

Because of that KDE should require the global use of UTF-8 in the programming guide lines.

See http://en.wikipedia.org/wiki/UTF-8 for further infromation.
Comment 3 Volker Krause 2009-03-30 22:10:25 UTC
I know what Unicode and UTF-8 is. KDE actually mandates the use of that for user-visible strings. And if you check the database schema closely you will see that columns containing such data (such as CollectionTable.name) use in fact UTF-8 encoding.

The remaining columns however contain internal data which cannot contain Unicode (eg. mimetypes). Using the (slightly slower) UTF-8 encoding is thus not needed there, Latin1 does the job just fine.

So, unless there are real bugs, I would not want to change anything there, risking to do more damage than good.