KDE Bug Tracking System
Home
Report New Wish or Bug
Query Existing Reports
First
Last
Prev
Next
No search results available
Search page
Bug
133170
:
Table names miss international characters
P
roduct
:
kexi
Co
m
ponent
:
General
Status
:
RESOLVED
Resolution
:
FIXED
Target
:
---
Version
:
1.0.2 (KOffice 1.5.2)
Pr
i
ority
:
NOR
Severity
:
normal
V
otes
:
0
Description
:
Opened:
2006-08-29 09:58
Last Changed:
2007-01-03 12:00:05
Version: 1.0.2 (KOffice 1.5.2) (using KDE 3.5.1 Level "a" , SUSE 10.1) Compiler: Target: i586-suse-linux OS: Linux (i686) release 2.6.16.21-0.13-default If I import MS Access DB to kexi, all tables and data get imported correctly (very good job!), just the i18n chars in the table names are replaced with '_', for instance "Úvěry" -> "_v_ry", which makes readability a little bit tough.
Comment
#1
Jaroslaw Staniek 2006-08-29 11:03:36
There is special way how names are used in Kexi. Every object (including tables, queries,...) has two properties: - name (only latin letters, started with letter or _ sign) - caption (any combination of international letters) On importing, your international names are imported as captions, but Kexi displays names within the Project navigator. Names are created automatically in the following way: Ú becames u, and so on. Could you please attach here a file with all national characters of your languages (and tell me what you language(s) are), so I can use this information in Kexi? The file should be encoded in utf-8, e.g. using Kate. Please include small and large letters.
Comment
#2
Jaroslaw Staniek 2006-08-29 11:07:03
The code containing the character mappings:
http://websvn.kde.org/branches/koffice/1.6/koffice/kexi/kexiutils/identifier.cpp?rev=522032
If you sent the mapping in form of "ś", "s", and so on, it will be easier for me to apply this to the code.
Comment
#3
Michal Svec 2006-08-29 13:59:31
OK, the language is Czech (cs_CZ) and here should be a complete list (at least according to
http://cs.wikipedia.org/wiki/Unicode
): "Á", "A", "Č", "C", "Ď", "D", "É", "E", "Ě", "E", "Í", "I", "Ň", "N", "Ó", "O", "Ř", "R", "Š", "S", "Ť", "T", "Ú", "U", "Ů", "U", "Ý", "Y", "Ž", "Z", "á", "a", "č", "c", "ď", "d", "é", "e", "ě", "e", "í", "i", "ň", "n", "ó", "o", "ř", "r", "š", "s", "ť", "t", "ú", "u", "ů", "u", "ý", "y", "ž", "z",
Comment
#4
Michal Svec 2006-08-29 14:01:29
However, a more general approach might be usable, like generating the alternatives automatically for some significant subset of utf-8. I'm not sure how, though :)
Comment
#5
Jaroslaw Staniek 2006-08-29 14:27:50
Re "generating the alternatives automatically for some significant subset of utf-8" - I tried to find such solution, with no luck. I plan to add one improvement though: defining the mappings in a config file, so user can alter it without a need for upgrading the software. Will be available in 1.1.
Comment
#6
Michal Svec 2006-08-29 14:55:45
Created an attachment (id=17548)
[details]
utf8 with ascii mappings I tried to create such mapping with combination of bash, printf and recode. Its format is 'hexcode; utf8 char; ascii equivalent'. Looks OK for some chars, but it would be good to select only some reasonable subset.
Comment
#7
Jaroslaw Staniek 2006-08-29 15:15:08
Oh, that's nice, what's exactly the command you have used to generate that? We may need further filtering as there are mappings like 0108;Ĉ;^C while it should be Ĉ -> C (^ is not allowed)
Comment
#8
Michal Svec 2006-08-29 16:24:43
Basically I did: /usr/bin/printf "\uXXXX\n" on all 0000-FFFF recode -f utf-8..flat I haven't checked all FFFF lines though :)
Comment
#9
Jaroslaw Staniek 2006-08-29 16:35:25
It would be interesting to dig into recode's source code to see how the mapping is implemented...
Comment
#10
Michal Svec 2006-08-30 13:21:56
On the first view recode is not very easy to read ;-) Maybe maintainers could have some hints how to approach this in general. Anyways, I hope the attachment from
comment #6
could be useful, possibly with some hand-editing, which I'm not sure how to aproach (besides filtering out empty fields).
Comment
#11
Jaroslaw Staniek 2006-08-30 13:38:37
That's true, recode's source code looks obfuscated :) Good thing is that it's LGPL. My proposals: I like the idea of automatic cleaning of the output - by: - removing empty fields - removing characters that are not latin1 letters, numbers and '_' There are special cases that may need care: german umlauts are written with two letters: A-umlaut is Ae, etc. If all this can be put in a script I'd commit it to the SVN for completness. The script should generate an .h file that can be directly included in identifier.cpp.
Comment
#12
Michal Svec 2006-08-31 13:22:41
This is roughly what I did (stupid, ugly, but did the trick quickly): for i in `seq 0 65535`; do printf "%04X\n" $i; done > utf8_1 while read f; do echo -n "$f;"; { /usr/bin/printf "\u${f}\n" 2>&- || echo " "; } ; done < utf8_1 > utf8_2 while read f; do echo -n "$f;"; echo "$(echo "$f"|cut -d\; -f2|recode -f utf-8..flat)" ; done < utf8_2 > utf8_3
Comment
#13
Jaroslaw Staniek 2006-08-31 20:14:42
Thanks a lot, Michal. I think I'll include this stuff in 1.1 version. Then, I'll close the bug.
Comment
#14
Michal Svec 2006-08-31 23:29:21
Your welcome, and thanks for Kexi!
Comment
#15
Jaroslaw Staniek 2007-01-03 12:00:02
SVN commit 619415 by staniek: Kexi - use transliteration table generated by a shell script to generate identifiers out of unicode characters; plus some adjustments made by hand BUG: 133170 2.0: merged CCMAIL:
michael@drueing.de
CCMAIL:
caslav.ilic@gmx.net
CCMAIL:
rebel@atrey.karlin.mff.cuni.cz
AM generate_transliteration_table.sh M +25 -169 identifier.cpp AM transliteration_table.h.bz2 A transliteration_table.h.patch A transliteration_table.readme AM update_transliteration_table_patch.sh ** branches/koffice/1.6/koffice/kexi/kexiutils/generate_transliteration_table.sh #property svn:executable + * --- branches/koffice/1.6/koffice/kexi/kexiutils/identifier.cpp #619414:619415 @@ -19,8 +19,8 @@ */ #include "identifier.h" -#include <kstaticdeleter.h> -#include <qdict.h> +#include "transliteration_table.h" +#include <kdebug.h> using namespace KexiUtils; @@ -44,189 +44,45 @@ return fn; } -// These are in pairs - first the non-latin character in UTF-8, -// the second is the latin character(s) to appear in identifiers. -static const char* string2Identifier_table[] = { -/* 1. Polish characters */ -"Ą", "A", "Ć", "C", "Ę", "E", -"Ł", "L", "Ń", "N", "Ó", "O", -"Ś", "S", "Ź", "Z", "Ż", "Z", -"ą", "a", "ć", "c", "ę", "e", -"ł", "l", "ń", "n", "ó", "o", -"ś", "s", "ź", "z", "ż", "z", - -/* 2. The mappings of the german "umlauts" to their 2-letter equivalents: - (Michael Drüing <michael at drueing.de>) - - Note that ß->ss is AFAIK not always correct transliteration, for example - "Maße" and "Masse" is different, the first meaning "measurements" (as - plural of "Maß" meaning "measurement"), the second meaning "(physical) - mass". They're also pronounced dirrefently, the first one is longer, the - second one short. */ -/** @todo the above three only appear at the beginning of a word. if the word is in - all caps - like in a caption - then the 2-letter equivalents should also be - in all caps */ -"Ä", "Ae", -"Ö", "Oe", -"Ü", "Ue", -"ä", "ae", -"ö", "oe", -"ü", "ue", -"ß", "ss", - -/* 3. The part of Serbian Cyrillic which is shared with other Cyrillics but - that doesn't mean I am sure that eg. Russians or Bulgarians would do the - same. (Chusslove Illich <caslav.ilic at gmx.net>) */ -"а", "a", -"б", "b", -"в", "v", -"г", "g", -"д", "d", -"е", "e", -"ж", "z", -"з", "z", -"и", "i", -"к", "k", -"л", "l", -"м", "m", -"н", "n", -"о", "o", -"п", "p", -"р", "r", -"с", "s", -"т", "t", -"у", "u", -"ф", "f", -"х", "h", -"ц", "c", -"ч", "c", -"ш", "s", -"А", "A", -"Б", "B", -"В", "V", -"Г", "G", -"Д", "D", -"Е", "E", -"Ж", "Z", -"З", "Z", -"И", "I", -"К", "K", -"Л", "L", -"М", "M", -"Н", "N", -"О", "O", -"П", "P", -"Р", "R", -"С", "S", -"Т", "T", -"У", "U", -"Ф", "F", -"Х", "H", -"Ц", "C", -"Ч", "C", -"Ш", "S", -// 3.1. The Serbian-specific Cyrillic characters: -"ђ", "dj", -"ј", "j", -"љ", "lj", -"њ", "nj", -"ћ", "c", -"џ", "dz", -"Ђ", "Dj", -"Ј", "J", -"Љ", "Lj", -"Њ", "Nj", -"Ћ", "C", -"Џ", "Dz", -// 3.2. The non-ASCII Serbian Latin characters: -"đ", "dj", -"ž", "z", -"ć", "c", -"č", "c", -"š", "s", -"Đ", "Dj", -"Ž", "Z", -"Ć", "C", -"Č", "C", -"Š", "S", -// 4. Czech characters (cs_CZ, Michal Svec) - "Á", "A", - "Č", "C", - "Ď", "D", - "É", "E", - "Ě", "E", - "Í", "I", - "Ň", "N", - "Ó", "O", - "Ř", "R", - "Š", "S", - "Ť", "T", - "Ú", "U", - "Ů", "U", - "Ý", "Y", - "Ž", "Z", - "á", "a", - "č", "c", - "ď", "d", - "é", "e", - "ě", "e", - "í", "i", - "ň", "n", - "ó", "o", - "ř", "r", - "š", "s", - "ť", "t", - "ú", "u", - "ů", "u", - "ý", "y", - "ž", "z", -// END. -0 -}; - -//! used for O(1) character transformations in char2Identifier() -static KStaticDeleter< QDict<QCString> > string2Identifier_deleter; -static QDict<QCString>* string2Identifier_dict = 0; - inline QString char2Identifier(const QChar& c) { - if ((c>='a' && c<='z') || (c>='A' && c<='Z') || (c>='0' && c<='9') || c=='_') - return QString(c); - else { - if (!string2Identifier_dict) { - //build dictionary for later use - string2Identifier_deleter.setObject( string2Identifier_dict, new QDict<QCString>(1009) ); - string2Identifier_dict->setAutoDelete(true); - for (const char **p = string2Identifier_table; *p; p+=2) { - string2Identifier_dict->replace( /* replace, not insert because there may be duplicates */ - QString::fromUtf8(*p), new QCString(*(p+1)) ); - } - } - const QCString *fixedChar = string2Identifier_dict->find(c); - if (fixedChar) - return *fixedChar; - } - return QString(QChar('_')); + kdDebug() << c << ": " << QString("0x%1").arg(c.unicode(),0,16) << endl; + + if (c.unicode() >= TRANSLITERATION_TABLE_SIZE) + return QString(QChar('_')); + const char *const s = transliteration_table[c.unicode()]; + return s ? QString::fromLatin1(s) : QString(QChar('_')); } QString KexiUtils::string2Identifier(const QString &s) { + if (s.isEmpty()) + return QString::null; QString r, id = s.simplifyWhiteSpace(); if (id.isEmpty()) - return id; + return QString::null; r.reserve(id.length()); -// return "_"; id.replace(' ',"_"); QChar c = id[0]; + QString add; + bool wasUnderscore = false; if (c>='0' && c<='9') { r+='_'; r+=c; - } else - r+=char2Identifier(c); + } else { + add = char2Identifier(c); + r+=add; + wasUnderscore = add == "_"; + } - for (uint i=1; i<id.length(); i++) - r+=char2Identifier(id.at(i)); + for (uint i=1; i<id.length(); i++) { + add = char2Identifier(id.at(i)); + if (wasUnderscore && add == "_") + continue; + wasUnderscore = add == "_"; + r+=add; + } return r; } ** branches/koffice/1.6/koffice/kexi/kexiutils/transliteration_table.h.bz2 #property svn:mime-type + application/octet-stream ** branches/koffice/1.6/koffice/kexi/kexiutils/update_transliteration_table_patch.sh #property svn:executable + *
P
latform
:
unspecified
O
S
:
Linux
K
eywords
:
People
Reporter
:
Michal Svec
Assigned To
:
Jaroslaw Staniek
Related actions
View Bug Activity
Format For Printing
XML
Clone This Bug
Note
You need to
log in
before you can comment on or make changes to this bug.
Attachments
utf8 with ascii mappings
(632.01 KB, text/plain)
2006-08-29 14:55
,
Michal Svec
Details
View All
Add an attachment
(proposed patch, testcase, etc.)
Depends on
:
B
locks
:
Show dependency tree
-
Show dependency graph
First
Last
Prev
Next
No search results available
Search page
Actions
Reports
Requests
Reports
Bugs reported today
Bugs reported in the last 3 days
Bug reports with patches
Weekly Bug statistics
The most hated bugs
The most severe bugs
The most frequently reported bugs
The most wanted features
Junior Jobs
Report ownership counts and charts
My Account
New Account
Log In