Version: 1.0.2 (KOffice 1.5.2) (using KDE 3.5.1 Level "a" , SUSE 10.1) Compiler: Target: i586-suse-linux OS: Linux (i686) release 2.6.16.21-0.13-default If I import MS Access DB to kexi, all tables and data get imported correctly (very good job!), just the i18n chars in the table names are replaced with '_', for instance "Úvěry" -> "_v_ry", which makes readability a little bit tough.
There is special way how names are used in Kexi. Every object (including tables, queries,...) has two properties: - name (only latin letters, started with letter or _ sign) - caption (any combination of international letters) On importing, your international names are imported as captions, but Kexi displays names within the Project navigator. Names are created automatically in the following way: Ú becames u, and so on. Could you please attach here a file with all national characters of your languages (and tell me what you language(s) are), so I can use this information in Kexi? The file should be encoded in utf-8, e.g. using Kate. Please include small and large letters.
The code containing the character mappings: http://websvn.kde.org/branches/koffice/1.6/koffice/kexi/kexiutils/identifier.cpp?rev=522032 If you sent the mapping in form of "ś", "s", and so on, it will be easier for me to apply this to the code.
OK, the language is Czech (cs_CZ) and here should be a complete list (at least according to http://cs.wikipedia.org/wiki/Unicode): "Á", "A", "Č", "C", "Ď", "D", "É", "E", "Ě", "E", "Í", "I", "Ň", "N", "Ó", "O", "Ř", "R", "Š", "S", "Ť", "T", "Ú", "U", "Ů", "U", "Ý", "Y", "Ž", "Z", "á", "a", "č", "c", "ď", "d", "é", "e", "ě", "e", "í", "i", "ň", "n", "ó", "o", "ř", "r", "š", "s", "ť", "t", "ú", "u", "ů", "u", "ý", "y", "ž", "z",
However, a more general approach might be usable, like generating the alternatives automatically for some significant subset of utf-8. I'm not sure how, though :)
Re "generating the alternatives automatically for some significant subset of utf-8" - I tried to find such solution, with no luck. I plan to add one improvement though: defining the mappings in a config file, so user can alter it without a need for upgrading the software. Will be available in 1.1.
Created an attachment (id=17548) [details] utf8 with ascii mappings I tried to create such mapping with combination of bash, printf and recode. Its format is 'hexcode; utf8 char; ascii equivalent'. Looks OK for some chars, but it would be good to select only some reasonable subset.
Oh, that's nice, what's exactly the command you have used to generate that? We may need further filtering as there are mappings like 0108;Ĉ;^C while it should be Ĉ -> C (^ is not allowed)
Basically I did: /usr/bin/printf "\uXXXX\n" on all 0000-FFFF recode -f utf-8..flat I haven't checked all FFFF lines though :)
It would be interesting to dig into recode's source code to see how the mapping is implemented...
On the first view recode is not very easy to read ;-) Maybe maintainers could have some hints how to approach this in general. Anyways, I hope the attachment from comment #6 could be useful, possibly with some hand-editing, which I'm not sure how to aproach (besides filtering out empty fields).
That's true, recode's source code looks obfuscated :) Good thing is that it's LGPL. My proposals: I like the idea of automatic cleaning of the output - by: - removing empty fields - removing characters that are not latin1 letters, numbers and '_' There are special cases that may need care: german umlauts are written with two letters: A-umlaut is Ae, etc. If all this can be put in a script I'd commit it to the SVN for completness. The script should generate an .h file that can be directly included in identifier.cpp.
This is roughly what I did (stupid, ugly, but did the trick quickly): for i in `seq 0 65535`; do printf "%04X\n" $i; done > utf8_1 while read f; do echo -n "$f;"; { /usr/bin/printf "\u${f}\n" 2>&- || echo " "; } ; done < utf8_1 > utf8_2 while read f; do echo -n "$f;"; echo "$(echo "$f"|cut -d\; -f2|recode -f utf-8..flat)" ; done < utf8_2 > utf8_3
Thanks a lot, Michal. I think I'll include this stuff in 1.1 version. Then, I'll close the bug.
Your welcome, and thanks for Kexi!
SVN commit 619415 by staniek: Kexi - use transliteration table generated by a shell script to generate identifiers out of unicode characters; plus some adjustments made by hand BUG: 133170 2.0: merged CCMAIL:michael@drueing.de CCMAIL:caslav.ilic@gmx.net CCMAIL:rebel@atrey.karlin.mff.cuni.cz AM generate_transliteration_table.sh M +25 -169 identifier.cpp AM transliteration_table.h.bz2 A transliteration_table.h.patch A transliteration_table.readme AM update_transliteration_table_patch.sh ** branches/koffice/1.6/koffice/kexi/kexiutils/generate_transliteration_table.sh #property svn:executable + * --- branches/koffice/1.6/koffice/kexi/kexiutils/identifier.cpp #619414:619415 @@ -19,8 +19,8 @@ */ #include "identifier.h" -#include <kstaticdeleter.h> -#include <qdict.h> +#include "transliteration_table.h" +#include <kdebug.h> using namespace KexiUtils; @@ -44,189 +44,45 @@ return fn; } -// These are in pairs - first the non-latin character in UTF-8, -// the second is the latin character(s) to appear in identifiers. -static const char* string2Identifier_table[] = { -/* 1. Polish characters */ -"Ą", "A", "Ć", "C", "Ę", "E", -"Ł", "L", "Ń", "N", "Ó", "O", -"Ś", "S", "Ź", "Z", "Ż", "Z", -"ą", "a", "ć", "c", "ę", "e", -"ł", "l", "ń", "n", "ó", "o", -"ś", "s", "ź", "z", "ż", "z", - -/* 2. The mappings of the german "umlauts" to their 2-letter equivalents: - (Michael Drüing <michael at drueing.de>) - - Note that ß->ss is AFAIK not always correct transliteration, for example - "Maße" and "Masse" is different, the first meaning "measurements" (as - plural of "Maß" meaning "measurement"), the second meaning "(physical) - mass". They're also pronounced dirrefently, the first one is longer, the - second one short. */ -/** @todo the above three only appear at the beginning of a word. if the word is in - all caps - like in a caption - then the 2-letter equivalents should also be - in all caps */ -"Ä", "Ae", -"Ö", "Oe", -"Ü", "Ue", -"ä", "ae", -"ö", "oe", -"ü", "ue", -"ß", "ss", - -/* 3. The part of Serbian Cyrillic which is shared with other Cyrillics but - that doesn't mean I am sure that eg. Russians or Bulgarians would do the - same. (Chusslove Illich <caslav.ilic at gmx.net>) */ -"а", "a", -"б", "b", -"в", "v", -"г", "g", -"д", "d", -"е", "e", -"ж", "z", -"з", "z", -"и", "i", -"к", "k", -"л", "l", -"м", "m", -"н", "n", -"о", "o", -"п", "p", -"р", "r", -"с", "s", -"т", "t", -"у", "u", -"ф", "f", -"х", "h", -"ц", "c", -"ч", "c", -"ш", "s", -"А", "A", -"Б", "B", -"В", "V", -"Г", "G", -"Д", "D", -"Е", "E", -"Ж", "Z", -"З", "Z", -"И", "I", -"К", "K", -"Л", "L", -"М", "M", -"Н", "N", -"О", "O", -"П", "P", -"Р", "R", -"С", "S", -"Т", "T", -"У", "U", -"Ф", "F", -"Х", "H", -"Ц", "C", -"Ч", "C", -"Ш", "S", -// 3.1. The Serbian-specific Cyrillic characters: -"ђ", "dj", -"ј", "j", -"љ", "lj", -"њ", "nj", -"ћ", "c", -"џ", "dz", -"Ђ", "Dj", -"Ј", "J", -"Љ", "Lj", -"Њ", "Nj", -"Ћ", "C", -"Џ", "Dz", -// 3.2. The non-ASCII Serbian Latin characters: -"đ", "dj", -"ž", "z", -"ć", "c", -"č", "c", -"š", "s", -"Đ", "Dj", -"Ž", "Z", -"Ć", "C", -"Č", "C", -"Š", "S", -// 4. Czech characters (cs_CZ, Michal Svec) - "Á", "A", - "Č", "C", - "Ď", "D", - "É", "E", - "Ě", "E", - "Í", "I", - "Ň", "N", - "Ó", "O", - "Ř", "R", - "Š", "S", - "Ť", "T", - "Ú", "U", - "Ů", "U", - "Ý", "Y", - "Ž", "Z", - "á", "a", - "č", "c", - "ď", "d", - "é", "e", - "ě", "e", - "í", "i", - "ň", "n", - "ó", "o", - "ř", "r", - "š", "s", - "ť", "t", - "ú", "u", - "ů", "u", - "ý", "y", - "ž", "z", -// END. -0 -}; - -//! used for O(1) character transformations in char2Identifier() -static KStaticDeleter< QDict<QCString> > string2Identifier_deleter; -static QDict<QCString>* string2Identifier_dict = 0; - inline QString char2Identifier(const QChar& c) { - if ((c>='a' && c<='z') || (c>='A' && c<='Z') || (c>='0' && c<='9') || c=='_') - return QString(c); - else { - if (!string2Identifier_dict) { - //build dictionary for later use - string2Identifier_deleter.setObject( string2Identifier_dict, new QDict<QCString>(1009) ); - string2Identifier_dict->setAutoDelete(true); - for (const char **p = string2Identifier_table; *p; p+=2) { - string2Identifier_dict->replace( /* replace, not insert because there may be duplicates */ - QString::fromUtf8(*p), new QCString(*(p+1)) ); - } - } - const QCString *fixedChar = string2Identifier_dict->find(c); - if (fixedChar) - return *fixedChar; - } - return QString(QChar('_')); + kdDebug() << c << ": " << QString("0x%1").arg(c.unicode(),0,16) << endl; + + if (c.unicode() >= TRANSLITERATION_TABLE_SIZE) + return QString(QChar('_')); + const char *const s = transliteration_table[c.unicode()]; + return s ? QString::fromLatin1(s) : QString(QChar('_')); } QString KexiUtils::string2Identifier(const QString &s) { + if (s.isEmpty()) + return QString::null; QString r, id = s.simplifyWhiteSpace(); if (id.isEmpty()) - return id; + return QString::null; r.reserve(id.length()); -// return "_"; id.replace(' ',"_"); QChar c = id[0]; + QString add; + bool wasUnderscore = false; if (c>='0' && c<='9') { r+='_'; r+=c; - } else - r+=char2Identifier(c); + } else { + add = char2Identifier(c); + r+=add; + wasUnderscore = add == "_"; + } - for (uint i=1; i<id.length(); i++) - r+=char2Identifier(id.at(i)); + for (uint i=1; i<id.length(); i++) { + add = char2Identifier(id.at(i)); + if (wasUnderscore && add == "_") + continue; + wasUnderscore = add == "_"; + r+=add; + } return r; } ** branches/koffice/1.6/koffice/kexi/kexiutils/transliteration_table.h.bz2 #property svn:mime-type + application/octet-stream ** branches/koffice/1.6/koffice/kexi/kexiutils/update_transliteration_table_patch.sh #property svn:executable + *
You need to log in before you can comment on or make changes to this bug.