Version: (using Devel) OS: Linux Installed from: Compiled sources Sonnet fails to recognize the word boundaries in a string and pass the parts of the words to spellchecker which in turn makes almost all words as wrongly spelled. The problem is in the sentence to word splitting used in filter kde4/kdelibs/kdecore/sonnet/filter.cpp . The isLetter() function fails for indic languages because of this bug in glibc :https://bugzilla.redhat.com/show_bug.cgi?id=466912 Nothing to be done for sonnet code I guess. Once the glibc get fixed , this should get fixed. But we may try any workarounds too. This bug makes KDE spellcheck for Indic languages in unusable state. How to reproduce: The following code snippet can be used for reproducing the bug QChar letter = 'ी'; fprintf(stdout,"%d\n", letter.isLetter());
The letter you use in this example is not a letter. Get its unicode and see that it is http://www.fileformat.info/info/unicode/char/a580/index.htm
The unicode you refered is wrong. Just copy that letter and use in this code snippet(Python): >>> print repr(u'ी') u'\u0940' It is U+0940 DEVANAGARI VOWEL SIGN II. I am refering to Hindi(hi_IN). This is vowel sign of U+0908 DEVANAGARI LETTER II. If sonnet(or any application) consider this vowel mark as a word delimiter and split a word there, it is buggy. The bug is just like splitting the English sentence "This is cat" to "Th" "is" "is" "c" "at"
Qt QCHar::unicode() returns 42368 for the above char. If this is wrong then issue a Qt bug report with a small test case and indicate what unicode number it should return. I spoke this morning with an Indian Qt guy who told me Qt does not misbehave. If you have proof of the contrary, please send to Qt bug tracker with test case and information. what I was told: "42368 is 0xA580. that isn't a malayalam character (i am not a malayalam speaker though, but i am 99% sure)" with the reference http://www.fileformat.info/info/unicode/char/a580/index.htm If this is wrong, please give proof and explain what is expected.
Qt QCHar::unicode() i think it require UTF16 as input, and looks like mistakenly given UTF8 of 'ी' this character, 'ी' utf8 = 0x e0 a5 80 some how 'e0' is lost somewhere, and it is returning decimal of 0xa580 which is correct please check it once http://doc.trolltech.com/snapshot/qchar.html#unicode-2 about bug, Unicode in Unicode Character database identifying Indic Matras as Punct which is wrong , we have recently fixed this in glibc. and now iswdigit, iswalpha are returning exact values for Indic Characters hoping sonnet is using glibc for getting character type and testing with latest version will fix the problem
Sorry, but this is very sad to see that after more than 10 years this bug is still happening in all KDE applications. tested with Hunspell with Malayalam dictionary, clearly it is broken and unusable for that language. So I'm reopening the bug, I hope you can find out how to fix this. If it's a Qt bug, please forward it there. I'm adding a screenshot showing the issue, and the corresponding text file to reproduce. Let us know if you need more info. (reproduced using sonnet 5.57, and hunspell 1.7.0)
Created attachment 120630 [details] screenshot of Kate showing the issue This screenshot shows clearly the issue. Malayalam dictionary is selected.
Created attachment 120631 [details] Text file to reproduce the issue You can use this file to troubleshoot the issue, as it contains perfectly valid Malayalam content that should not be flagged as incorrect by the spell checker.
Created attachment 123694 [details] Screenshot of frameworks+Kate from master from October I can't reproduce this problem with the Mayalayam text file above. Sonnet::TextBreaks was ported to QTextBoundaryFinder quite some time ago, which shouldn't be suffering the glibc problem; glibc was patched in ~2009, too, though. As per the screenshot, the words are identified correctly and hunspell doesn't return misspellings.
I reproduced it by launching Kate from the terminal. Here is the error message: sonnet.core: Missing trigrams for languages: QSet("he_IL", "ml_IN") Note: I have hunspell-ml dictionary installed OS: Mageia 7 hunspell 1.7 sonnet 5.57
Missing trigrams could mean that automatic language detection failed. Maybe this is the problem. Can you try overriding it it by disabling automatic language detection and and setting Malayalam as the default language? If this works (as it did here), then the problem is in this code: https://cgit.kde.org/sonnet.git/tree/src/core/guesslanguage.cpp It's worth nothing that trigram-based detection is a fallback codepath. We'd first need to determine why the script-based detection here fails: https://cgit.kde.org/sonnet.git/tree/src/core/guesslanguage.cpp#n172 It's true the trigram data set doesn't have trigrams for Malayalam though which could be worth contributing if you can: https://cgit.kde.org/sonnet.git/tree/data/trigrams
Indeed when I set default language as Malayalam it is working. Still the problem exist when selecting just the language from "choose dictionary". I will try to contribute trigrams.
> https://cgit.kde.org/sonnet.git/tree/src/core/guesslanguage.cpp > It's worth nothing that trigram-based detection is a fallback codepath. We'd first need to determine why the script-based detection here fails: As far as I can understand trigram based detection isn't a fallback. It is the first source, and if that fails Sonnet bruteforces the dictionaries for a guess. The script detection is working fine, but if the detected scripted doesn't have any trigrams, no candidate languages are returned and if no candidate languages are returned, sonnet goes ahead and checks the sample text with the 'default language' and fails. The solution that can fix this particular issue is quite simple. We just check the script of the text, then we count the number of languages a particular script has. If it has only one, like in this case, we just return that language. The language can be more than one for latin script languages though.
Patch submitted at: https://phabricator.kde.org/D25495
This was fixed two months ago when that patch landed.