files are indexed, you can search with English part of filename, But you get nothing when searching the Chinese part of filename. Reproducible: Always Steps to Reproduce: 1.creat some files with Chinese characters, example: touch 測試testcase.txt 測試: means test 2.baloosearch testcase, you can see this file in list /home/******/test/測試testcase.txt Elapsed: 0.706571 msecs 3.baloosearch 測試 or baloosearch "測試" you only get Elapsed: 0.368735 msecs Actual Results: baloosearch 測試 Elapsed: 0.368735 msecs can't find my test file Expected Results: should list 測試testcase.txt this bug affect dolphin search , Desktop search and krunner so All KF5 system can't search any Chinese characters may affect Japanese , Korean characters search (maybe, not sure)
I can confirm this bug on archlinux with baloo 5.24
Confirmed, using Baloo 5.26.
The same situation with baloo-5.26.0-r2 on Gentoo amd64. Yes, it affects Chinese letters in Japanese.
I am on arch with baloo 5.29.0 and have the same situation. Actually, this is happening a long time ago (if I'm correct, I encountered this issue no latter than 2015). I believe another bug report #356474 also refers to the same issue. One more detail I should mention: Since I have filenames with both Chinese characters and latin characters, when searching with the full filename, only the latin part is considered by baloosearch (the result contains other files which also have the same term as the latin part) e.g. The result of `baloosearch abc測試` is the same as `baloosearch abc`. I think many people just turn baloo off because of the cpu consumption when indexing (which could take quite a long time as you all know) so they may not be aware of this issue.
*** Bug 356474 has been marked as a duplicate of this bug. ***
http://swyear.blogspot.co.uk/2016/05/cant-use-baloosearch-for-chinese.html has detailed description and examples (screenshots) of this problem. The description is in English. (Also mentioned in the page above) https://bugs.kde.org/show_bug.cgi?id=333037#c25 has a quick patch, but doesn't seem to be merged into the trunk yet. There are also some (possibily) in depth discussion about why this happens.
I think a good start would be to create a database of testcases, so even a developer not proficient in a specific script can test and improve the coverage. One possible format could be: # description of the testcase ! filename_1234.png + filename ;; match filename + png ;; match png + 1234 ;; match 1234 - file ;; do not match file + file* ;; match with wildcard # chinese testcase 1: match 測試 ("test") ! 測試testcase.txt + 測試 + testcase + txt ;- 試測 ;; expected to fail, requires dictionary + 測試 txt Testcases should probably be split into one file per language (combination). Correct handling of languages like chinese - where words are not separated - is hard, and requires a dictionary as far as I understand. The best one can do currently is to split at a grapheme level. This would likely create a lot of false positives when searching, but false positives are IMHO better than no results at all.
Stefan, thanks for the suggestion of writting test cases. Yes, searching in character level (grapheme, if I understand this word correctly) is far better than nothing. To what I know (as a native Chinese speaker), many (or even most) Chinese people are happy enough if the software can deal with things in character level. Actually, using a dictionary is still not enough for Chinese -- it often happens that three (or more) character can be split in two different ways and they both make sense without context. A simple example of this scenario could be "化學生": both "化學" (chemistry) and "學生" (student) make sense (moreover, sometimes "化學生" also makes sense, meaning "a student whose major is in chemistry"), so context is the only way we can tell how to correctly split them (e.g. "教化學生" will most likely be split into "教化" [enlighten/teach] and "學生" [student]). Correctly handling of Chinese words requires more sophisticated Natural Language Processing techniques (e.g. using machine learning), and I think that would be far beyond today's baloo (or maybe even any search / index engines). (I have studied machine learning and natural language processing during my masters, so it should be safe for me to say that today's NLP technique [for Chinese word-splitting] is not yet good enough to be used in production [compared with character level and judged in a user's sense, i.e. false positive is better than false negative].) Classical Chinese (this is a style of composing sentences and ways of understanding characters / words, not like the different between "Traditional Chinese" and "Simplified Chinese") makes the situation more difficult. Almost all historical texts (e.g. history recordings / books / poems) (there are quite a LOT) are written in Classical Chinese, and nowadays Chinese people still study Classical Chinese and read those things (though we usually don't write in Classical Chinese). Even humans may still need some effort to read a piece of text written in Classical Chinese (but Classical Chinese is very very consise, that's one of the reasons it exists). However, in Classical Chinese, characters "are" words in many cases. Splitting by characters is a very good choice.
There is work in progress to tackle this. https://phabricator.kde.org/D11552
I still encounter this in Fedora36 in 2022. Was the fix merged yet?