Bug 362647

Summary: Can't search with Chinese characters
Product: [Frameworks and Libraries] frameworks-baloo Reporter: Ray Chen <swyear>
Component: generalAssignee: Pinak Ahuja <pinak.ahuja>
Status: RESOLVED FIXED    
Severity: major CC: guotao945, kazuspara, laichiaheng, leodream2008, nate, ottwolt, pinak.ahuja, renyuneyun, stefan.bruens, taocrismon, tianer2820, zwpwjwtz
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: openSUSE   
OS: Linux   
See Also: https://bugs.kde.org/show_bug.cgi?id=333037
Latest Commit: Version Fixed In: 5.76.0
Sentry Crash Report:

Description Ray Chen 2016-05-04 04:17:48 UTC
files are indexed, you can search with English part of filename, But you get nothing when searching the Chinese part of filename.

Reproducible: Always

Steps to Reproduce:
1.creat some files with Chinese characters, example: touch 測試testcase.txt
測試: means test
2.baloosearch testcase, you can see this file in list
/home/******/test/測試testcase.txt
Elapsed: 0.706571 msecs
3.baloosearch 測試 or baloosearch "測試" you only get 
Elapsed: 0.368735 msecs

Actual Results:  
baloosearch 測試
Elapsed: 0.368735 msecs
can't find my test file

Expected Results:  
should list 測試testcase.txt

this bug affect dolphin search , Desktop search and krunner
so All KF5 system can't search any Chinese characters
may affect Japanese , Korean characters search (maybe, not sure)
Comment 1 Tao Guo 2016-09-19 03:33:12 UTC
I can confirm this bug on archlinux with baloo 5.24
Comment 2 zwpwjwtz 2016-09-25 14:53:19 UTC
Confirmed, using Baloo 5.26.
Comment 3 kazuhirokunishi 2016-12-10 17:49:08 UTC
The same situation with baloo-5.26.0-r2 on Gentoo amd64. Yes, it affects Chinese letters in Japanese.
Comment 4 Rui Zhao 2017-01-13 22:03:58 UTC
I am on arch with baloo 5.29.0 and have the same situation.
Actually, this is happening a long time ago (if I'm correct, I encountered this issue no latter than 2015).

I believe another bug report #356474 also refers to the same issue.

One more detail I should mention:

Since I have filenames with both Chinese characters and latin characters, when searching with the full filename, only the latin part is considered by baloosearch (the result contains other files which also have the same term as the latin part)

e.g.
The result of `baloosearch abc測試` is the same as `baloosearch abc`.



I think many people just turn baloo off because of the cpu consumption when indexing (which could take quite a long time as you all know) so they may not be aware of this issue.
Comment 5 Rui Zhao 2018-03-12 18:59:05 UTC
*** Bug 356474 has been marked as a duplicate of this bug. ***
Comment 6 Rui Zhao 2018-03-12 22:07:46 UTC
http://swyear.blogspot.co.uk/2016/05/cant-use-baloosearch-for-chinese.html
has detailed description and examples (screenshots) of this problem.

The description is in English.


(Also mentioned in the page above)
https://bugs.kde.org/show_bug.cgi?id=333037#c25
has a quick patch, but doesn't seem to be merged into the trunk yet. There are also some (possibily) in depth discussion about why this happens.
Comment 7 Stefan Brüns 2018-03-20 20:34:38 UTC
I think a good start would be to create a database of testcases, so even a developer not proficient in a specific script can test and improve the coverage.

One possible format could be:

# description of the testcase
! filename_1234.png
+ filename ;; match filename
+ png ;; match png
+ 1234 ;; match 1234
- file ;; do not match file
+ file* ;; match with wildcard

# chinese testcase 1: match 測試 ("test")
! 測試testcase.txt
+ 測試
+ testcase
+ txt
;- 試測 ;; expected to fail, requires dictionary
+ 測試 txt

Testcases should probably be split into one file per language (combination).

Correct handling of languages like chinese - where words are not separated -  is hard, and requires a dictionary as far as I understand. The best one can do currently is to split at a grapheme level. This would likely create a lot of false positives when searching, but false positives are IMHO better than no results at all.
Comment 8 Rui Zhao 2018-03-21 23:40:37 UTC
Stefan, thanks for the suggestion of writting test cases.

Yes, searching in character level (grapheme, if I understand this word correctly) is far better than nothing.
To what I know (as a native Chinese speaker), many (or even most) Chinese people are happy enough if the software can deal with things in character level.


Actually, using a dictionary is still not enough for Chinese -- it often happens that three (or more) character can be split in two different ways and they both make sense without context.
A simple example of this scenario could be "化學生": both "化學" (chemistry) and "學生" (student) make sense (moreover, sometimes "化學生" also makes sense, meaning "a student whose major is in chemistry"), so context is the only way we can tell how to correctly split them (e.g. "教化學生" will most likely be split into "教化" [enlighten/teach] and "學生" [student]).
Correctly handling of Chinese words requires more sophisticated Natural Language Processing techniques (e.g. using machine learning), and I think that would be far beyond today's baloo (or maybe even any search / index engines). (I have studied machine learning and natural language processing during my masters, so it should be safe for me to say that today's NLP technique [for Chinese word-splitting] is not yet good enough to be used in production [compared with character level and judged in a user's sense, i.e. false positive is better than false negative].)

Classical Chinese (this is a style of composing sentences and ways of understanding characters / words, not like the different between "Traditional Chinese" and "Simplified Chinese") makes the situation more difficult. Almost all historical texts (e.g. history recordings / books / poems) (there are quite a LOT) are written in Classical Chinese, and nowadays Chinese people still study Classical Chinese and read those things (though we usually don't write in Classical Chinese). Even humans may still need some effort to read a piece of text written in Classical Chinese (but Classical Chinese is very very consise, that's one of the reasons it exists).
However, in Classical Chinese, characters "are" words in many cases. Splitting by characters is a very good choice.
Comment 9 Michael Heidelbach 2018-03-22 21:45:48 UTC
There is work in progress to tackle this. 
https://phabricator.kde.org/D11552
Comment 10 Toby 2022-07-08 14:19:53 UTC
I still encounter this in Fedora36 in 2022. Was the fix merged yet?