Bug 425020 - baloosearch's search support for Chinese, Japanese and Korean is very weak.
Summary: baloosearch's search support for Chinese, Japanese and Korean is very weak.
Status: RESOLVED FIXED
Alias: None
Product: frameworks-baloo
Classification: Frameworks and Libraries
Component: general (show other bugs)
Version: unspecified
Platform: Arch Linux Linux
: NOR major
Target Milestone: ---
Assignee: Stefan Brüns
URL:
Keywords:
: 421385 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-08-05 04:55 UTC by logicinu
Modified: 2023-11-13 08:55 UTC (History)
4 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Chinese (233.79 KB, image/png)
2020-08-05 04:55 UTC, logicinu
Details
Korean (254.62 KB, image/png)
2020-08-05 04:55 UTC, logicinu
Details
Japanese (338.53 KB, image/png)
2020-08-05 04:56 UTC, logicinu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description logicinu 2020-08-05 04:55:16 UTC
Created attachment 130650 [details]
Chinese

SUMMARY
baloosearch's search support for China, Japan and Korea is very weak.


STEPS TO REPRODUCE
1. Create file
邓丽君 - 漫步人生路.mp3
いとうかなこ - スカイクラッドの観测者.mp3
량현량하 - 학교를 안갔어!.mp3

2. Do test

3. Screenshot

OBSERVED RESULT
Can't find anything

EXPECTED RESULT
Precise search

SOFTWARE/OS VERSIONS
Operating System: Arch Linux
KDE Plasma Version: 5.19.4
KDE Frameworks Version: 5.72.0
Qt Version: 5.15.0
Kernel Version: 5.7.9-zen1-1-zencjk
OS Type: 64-bit
Processors: 8 × AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx
Memory: 14.6 GiB of RAM
Graphics Processor: AMD RAVEN

ADDITIONAL INFORMATION
Comment 1 logicinu 2020-08-05 04:55:47 UTC
Created attachment 130651 [details]
Korean
Comment 2 logicinu 2020-08-05 04:56:09 UTC
Created attachment 130652 [details]
Japanese
Comment 3 logicinu 2020-08-05 05:01:29 UTC
1.No response to Chinese, including Chinese in Japanese
2. Respond to the overall words in Japanese and Korean, but does not respond to word segmentation search
3. No response to - symbol
Comment 4 Stefan Brüns 2020-08-05 19:23:58 UTC
Punctuation and whitespace are discarded, so '-' can never match.

The other part of your report is to vague. Do *not* use screenshots, they are pointless for a CLI program like baloosearch, and I will ignore them.

You can query the terms baloo has stored for a given file with 'balooshow -x <filename>'.
Comment 5 logicinu 2020-08-06 04:44:31 UTC
logicinu@laptop: test » balooshow -x 邓丽君\ -\ 漫步人生路.mp3.txt                                                                                       1 [12:28:13]
1153977280582712835 65027 268681273 邓丽君 - 漫步人生路.mp3.txt [/home/logicinu/Music/test/邓丽君 - 漫步人生路.mp3.txt]
        Mtime: 1596688050 2020-08-06T12:27:30
        Ctime: 1596688050 2020-08-06T12:27:30
        Cached properties:
                行数: 1

内部信息
术语:Mplain Mtext T5 T8 Ttext X20-1 
文件名术语:Fmp3 Ftxt 
XAttr 个术语:
lineCount: 1
logicinu@laptop: test » balooshow -x 량현량하\ -\ 학교를\ 안갔어!.mp3.txt                                                                                  [12:28:17]
1153985544099790339 65027 268683197 량현량하 - 학교를 안갔어!.mp3.txt [/home/logicinu/Music/test/량현량하 - 학교를 안갔어!.mp3.txt]
        Mtime: 1596688141 2020-08-06T12:29:01
        Ctime: 1596688141 2020-08-06T12:29:01
        Cached properties:
                行数: 1

内部信息
术语:Mplain Mtext T5 T8 Ttext X20-1 
文件名术语:Fmp3 Ftxt F량현량하 F안갔어 F학교를 
XAttr 个术语:
lineCount: 1
logicinu@laptop: test » balooshow -x いとうかなこ\ -\ スカイクラッドの観测者.mp3.txt                                                                       [12:29:09]
1153981932032294403 65027 268682356 いとうかなこ - スカイクラッドの観测者.mp3.txt [/home/logicinu/Music/test/いとうかなこ - スカイクラッドの観测者.mp3.txt]
        Mtime: 1596688133 2020-08-06T12:28:53
        Ctime: 1596688133 2020-08-06T12:28:53
        Cached properties:
                行数: 1

内部信息
术语:Mplain Mtext T5 T8 Ttext X20-1 
文件名术语:Fmp3 Ftxt Fスカイクラット 
XAttr 个术语:
lineCount: 1
logicinu@laptop: test » balooshow -x 2NE1\ -\ Come\ Back\ Home.mp3.txt                                                                                     [12:29:17]
1153990079585254915 65027 268684253 2NE1 - Come Back Home.mp3.txt [/home/logicinu/Music/test/2NE1 - Come Back Home.mp3.txt]
        Mtime: 1596688188 2020-08-06T12:29:48
        Ctime: 1596688188 2020-08-06T12:29:48
        Cached properties:
                行数: 1

内部信息
术语:Mplain Mtext T5 T8 Ttext X20-1 
文件名术语:F2ne1 Fback Fcome Fhome Fmp3 Ftxt 
XAttr 个术语:
lineCount: 1
logicinu@laptop: test » balooshow -x Mark\ Ronson,Bruno\ Mars\ -\ Uptown\ Funk\(1\).mp3.txt                                                                [12:29:56]
1154019938197896707 65027 268691205 Mark Ronson,Bruno Mars - Uptown Funk(1).mp3.txt [/home/logicinu/Music/test/Mark Ronson,Bruno Mars - Uptown Funk(1).mp3.txt]
        Mtime: 1596688573 2020-08-06T12:36:13
        Ctime: 1596688573 2020-08-06T12:36:13
        Cached properties:
                行数: 1

内部信息
术语:Mplain Mtext T5 T8 Ttext X20-1 
文件名术语:F1 Fbruno Ffunk Fmark Fmars Fmp3 Fronson Ftxt Fuptown 
XAttr 个术语:
lineCount: 1
logicinu@laptop: test »
Comment 6 logicinu 2020-08-06 04:44:48 UTC
File search is to find things. Baloo participles have so few vocabulary and nothing can be found. It is better not to put the search function in dolphin, it has no effect at all, for example "(" will not find.
Comment 7 Stefan Brüns 2020-08-06 05:33:05 UTC
(In reply to logicinu from comment #6)
> File search is to find things. Baloo participles have so few vocabulary and
> nothing can be found. It is better not to put the search function in
> dolphin, it has no effect at all, for example "(" will not find.

Ok, apparently you are only here to rant. Thanks and goodbye ...
Comment 8 logicinu 2020-08-06 05:40:02 UTC
So this is rant? I don't want to say anything either. Bye!
Comment 9 2wxsy58236r3 2020-08-06 07:43:07 UTC
@logicinu

不知你是否愿意以中文详细解释你遇到的问题?

(Would you please explain your problem in Chinese?)
Comment 10 logicinu 2020-08-06 07:59:59 UTC
dolphin文件管理器的默认搜索用的是baloosearch,对是搜索功能来说,是不是太没用了,它是基于分词的,但是支持太少了,符号不支持,中文不支持,分词不够细,对于文件搜索来说,这完全不是想要的,搜索应该可以找到文件名中存在的字符,baloosearch不能干活,却还是默认搜索,搜索“mp3”它可以,搜索”3“它就不行了,它的意义何在,希望可以改进。
Comment 11 2wxsy58236r3 2020-08-06 09:18:46 UTC
Problem:
Baloo is unable to extract keywords when the filename contains Chinese (and Japanese) characters.

Example:
邓丽君 - 漫步人生路.mp3.txt

Expected result:
Baloo should at least extract "邓丽君", "漫步人生路", "mp3" and "txt".
For better results, it should extract "邓", "丽", "君", "漫", "步", "人", "生", "路", "mp3" and "txt".

Actual result:
Baloo only extracts "mp3" and "txt".
Comment 12 2wxsy58236r3 2020-08-06 09:29:22 UTC
@logicinu

关于分词不够细的问题,我觉得期望 Baloo 能把 “漫步人生路” 逐字分拆成 “漫” “步” “人” “生” “路” 是合理的。

但如果期望 Baloo 能辨认出当中的词语,例如:
“漫步” “人生” “路”
“漫步” “人生路”

则技术上或許不太可行。
Comment 13 logicinu 2020-08-06 09:37:11 UTC
这样的分词怎么分,都不能满足搜索的需求的,应该是我想错了,不是Baloo的错,是dolphin文件管理器用Baloo当搜索就有问题,应当建议dolphin去除Baloo当默认搜索。
Comment 14 Christoph Feck 2020-08-06 10:00:58 UTC
Detecting CJK words is possible with ICU: http://userguide.icu-project.org/boundaryanalysis
Comment 15 Stefan Brüns 2020-08-06 10:11:54 UTC
(In reply to Christoph Feck from comment #14)
> Detecting CJK words is possible with ICU:
> http://userguide.icu-project.org/boundaryanalysis

Baloo uses Qts boundary finder, which in turn uses icu.
Comment 16 Christoph Feck 2020-08-06 10:21:08 UTC
Nope, it doesn't. It only treats all characters that have the "wordBreak" property set.

Source at https://code.qt.io/cgit/qt/qtbase.git/tree/src/corelib/text/qtextboundaryfinder.cpp
Comment 17 2wxsy58236r3 2020-08-06 10:26:01 UTC
> ICU BreakIterators can be used to locate the following kinds of text boundaries:
> 1. Character Boundary
> 2. Word Boundary
> 3. Line-break Boundary
> 4. Sentence Boundary

For Chinese and Japanese, I believe "Character Boundary" is applicable but "Word Boundary" is not.

Since these two languages do not use spaces to separate words [1], I believe word segmentation [2] is difficult unless dictionaries or AI are used.

Links:
[1] https://en.wikipedia.org/wiki/Word_divider
[2] https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation
Comment 18 Christoph Feck 2020-08-06 10:31:21 UTC
ICU data includes a dictionary of more than 300,000 CJK words (4 MB) and is automatically handled when detecting a CJK script.
Comment 19 Bug Janitor Service 2020-08-27 08:14:57 UTC
A possibly relevant merge request was started @ https://invent.kde.org/frameworks/baloo/-/merge_requests/11
Comment 20 Weng Xuetian 2020-08-27 08:18:16 UTC
I created a PR at https://invent.kde.org/frameworks/baloo/-/merge_requests/11 by using ICU, would you mind to give it a test?
Comment 21 Weng Xuetian 2020-08-27 08:55:59 UTC
Emm, other than the place in the PR, looks like there's another place need to be fixed: Baloo::QueryParser::parseQuery. Similar algorithm is applied but it has some extra code for handling the quote.

Current implementation will not generate any term on the Chinese or Japanese Characters which also leads to empty result.
Comment 22 Weng Xuetian 2020-08-27 19:11:45 UTC
I added the fix in PR for query parser.
So I think it's now ready to test:

You'll need to purge the old data to trigger the new code.
Right now I'm able to search Chinese content/file name quite easily.

E.g.

$ baloosearch 恋爱
/home/csslayer/Downloads/Telegram Desktop/恋爱循环歌词-中文+罗马音+日文.odt
/home/csslayer/Downloads/Telegram Desktop/恋爱循环歌词-中文+罗马音+日文.doc
Comment 23 Ismael Asensio 2020-10-16 18:54:52 UTC
*** Bug 421385 has been marked as a duplicate of this bug. ***