Bug 176537 - Sonnet fails to do spellcheck on Indian languages
Summary: Sonnet fails to do spellcheck on Indian languages
Status: RESOLVED FIXED
Alias: None
Product: frameworks-sonnet
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: 5.57.0
Platform: Mageia RPMs Linux
: NOR normal
Target Milestone: ---
Assignee: Waqar Ahmed
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-11-30 12:55 UTC by Santhosh Thottingal
Modified: 2020-05-22 18:44 UTC (History)
9 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
screenshot of Kate showing the issue (56.94 KB, image/png)
2019-06-06 16:04 UTC, animtim
Details
Text file to reproduce the issue (333 bytes, text/plain)
2019-06-06 16:06 UTC, animtim
Details
Screenshot of frameworks+Kate from master from October (361.80 KB, image/png)
2019-11-03 11:54 UTC, Eike Hein
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Santhosh Thottingal 2008-11-30 12:55:04 UTC
Version:            (using Devel)
OS:                Linux
Installed from:    Compiled sources

Sonnet fails to recognize the word boundaries in a string and pass the parts of the words to spellchecker which in turn makes almost all words as wrongly spelled. 
The problem is in the sentence to word splitting used in filter kde4/kdelibs/kdecore/sonnet/filter.cpp . The isLetter() function fails for indic languages because of this bug in glibc :https://bugzilla.redhat.com/show_bug.cgi?id=466912

Nothing to be done for sonnet code I guess. Once the glibc get fixed , this should get fixed. But we may try any workarounds too. This bug makes KDE spellcheck for Indic languages in unusable state.

How to reproduce:
The following code snippet can be used for reproducing the bug
QChar letter = 'ी';
fprintf(stdout,"%d\n", letter.isLetter());
Comment 1 Anne-Marie Mahfouf 2009-03-04 09:52:49 UTC
The letter you use in this example is not a letter.
Get its unicode and see that it is
http://www.fileformat.info/info/unicode/char/a580/index.htm
Comment 2 Santhosh Thottingal 2009-03-04 16:06:41 UTC
The unicode you refered is wrong.
Just copy that letter and use in this code snippet(Python):
>>> print repr(u'ी')
u'\u0940'

It is U+0940 DEVANAGARI VOWEL SIGN II. I am refering to Hindi(hi_IN). This is vowel sign of U+0908 DEVANAGARI LETTER II. If sonnet(or any application) consider this vowel mark as a word delimiter and split a word there,  it is buggy.
The bug is just like splitting the English sentence "This is cat"  to "Th" "is" "is"  "c"  "at"
Comment 3 Anne-Marie Mahfouf 2009-03-04 19:05:19 UTC
Qt QCHar::unicode() returns 42368 for the above char.

If this is wrong then issue a Qt bug report with a small test case and indicate what unicode number it should return.
I spoke this morning with an Indian Qt guy who told me Qt does not misbehave. If you have proof of the contrary, please send to Qt bug tracker with test case and information.
what I was told: "42368 is 0xA580. that isn't a malayalam character (i am not a malayalam speaker though, but i am 99% sure)" with the reference
http://www.fileformat.info/info/unicode/char/a580/index.htm

If this is wrong, please give proof and explain what is expected.
Comment 4 Pravin S 2009-03-05 06:09:35 UTC
Qt QCHar::unicode() i think it require UTF16 as input, 
and looks like mistakenly given UTF8 of 'ी' this character, 
'ी' utf8 = 0x e0 a5 80 
some how 'e0' is lost somewhere, and it is returning decimal of 0xa580 which is correct 
please check it once 

http://doc.trolltech.com/snapshot/qchar.html#unicode-2

about bug, Unicode in Unicode Character database identifying Indic Matras as Punct which is wrong , we have recently fixed this in glibc.
and now iswdigit, iswalpha are returning exact values for Indic Characters

hoping sonnet is using glibc for getting character type and testing with latest version will fix the problem
Comment 5 animtim 2019-06-06 16:02:54 UTC
Sorry, but this is very sad to see that after more than 10 years this bug is still happening in all KDE applications.

tested with Hunspell with Malayalam dictionary, clearly it is broken and unusable for that language.

So I'm reopening the bug, I hope you can find out how to fix this. If it's a Qt bug, please forward it there.

I'm adding a screenshot showing the issue, and the corresponding text file to reproduce. Let us know if you need more info.

(reproduced using sonnet 5.57, and hunspell 1.7.0)
Comment 6 animtim 2019-06-06 16:04:29 UTC
Created attachment 120630 [details]
screenshot of Kate showing the issue

This screenshot shows clearly the issue. Malayalam dictionary is selected.
Comment 7 animtim 2019-06-06 16:06:20 UTC
Created attachment 120631 [details]
Text file to reproduce the issue

You can use this file to troubleshoot the issue, as it contains perfectly valid Malayalam content that should not be flagged as incorrect by the spell checker.
Comment 8 Eike Hein 2019-11-03 11:54:25 UTC
Created attachment 123694 [details]
Screenshot of frameworks+Kate from master from October

I can't reproduce this problem with the Mayalayam text file above.

Sonnet::TextBreaks was ported to QTextBoundaryFinder quite some time ago, which shouldn't be suffering the glibc problem; glibc was patched in ~2009, too, though.

As per the screenshot, the words are identified correctly and hunspell doesn't return misspellings.
Comment 9 aiswarya 2019-11-04 17:43:01 UTC
I reproduced it by launching Kate from the terminal. Here is the error message:
sonnet.core: Missing trigrams for languages: QSet("he_IL", "ml_IN")

Note: I have hunspell-ml dictionary installed
OS: Mageia 7
hunspell 1.7
sonnet 5.57
Comment 10 Eike Hein 2019-11-08 20:16:09 UTC
Missing trigrams could mean that automatic language detection failed. Maybe this is the problem. Can you try overriding it it by disabling automatic language detection and and setting Malayalam as the default language?

If this works (as it did here), then the problem is in this code:

https://cgit.kde.org/sonnet.git/tree/src/core/guesslanguage.cpp

It's worth nothing that trigram-based detection is a fallback codepath. We'd first need to determine why the script-based detection here fails:

https://cgit.kde.org/sonnet.git/tree/src/core/guesslanguage.cpp#n172

It's true the trigram data set doesn't have trigrams for Malayalam though which could be worth contributing if you can:

https://cgit.kde.org/sonnet.git/tree/data/trigrams
Comment 11 aiswarya 2019-11-14 19:25:24 UTC
Indeed when I set default language as Malayalam it is working. Still the problem exist when selecting just the language from "choose dictionary".

I will try to contribute trigrams.
Comment 12 Waqar Ahmed 2019-11-23 18:01:34 UTC
> https://cgit.kde.org/sonnet.git/tree/src/core/guesslanguage.cpp

> It's worth nothing that trigram-based detection is a fallback codepath. We'd first need to determine why the script-based detection here fails:

As far as I can understand trigram based detection isn't a fallback. It is the first source, and if that fails Sonnet bruteforces the dictionaries for a guess.

The script detection is working fine, but if the detected scripted doesn't have any trigrams, no candidate languages are returned and if no candidate languages are returned, sonnet goes ahead and checks the sample text with the 'default language' and fails.

The solution that can fix this particular issue is quite simple. We just check the script of the text, then we count the number of languages a particular script has. If it has only one, like in this case, we just return that language. The language can be more than one for latin script languages though.
Comment 13 Waqar Ahmed 2019-11-23 18:26:34 UTC
Patch submitted at: https://phabricator.kde.org/D25495
Comment 14 Nate Graham 2020-05-22 18:44:45 UTC
This was fixed two months ago when that patch landed.