Bug 442385 - Chinese manual search term segmentation issues round 2
Summary: Chinese manual search term segmentation issues round 2
Status: RESOLVED FIXED
Alias: None
Product: krita
Classification: Applications
Component: Documentation (show other bugs)
Version: git master (please specify the git hash!)
Platform: Other Other
: NOR normal
Target Milestone: ---
Assignee: Alvin Wong
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-13 14:07 UTC by Alvin Wong
Modified: 2021-09-20 14:34 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alvin Wong 2021-09-13 14:07:37 UTC
I noticed that searching does not work with the tern "数位板" in zh_CN ("數位板" in zh_TW), likely because the dictionary that Jieba uses does not contain this term, which causes "数位" to be indexed as a term and "板" being dropped.

We need to somehow be able to add custom terms into the Jieba dictionary for the manual build.
Comment 1 Bug Janitor Service 2021-09-15 08:00:59 UTC
A possibly relevant merge request was started @ https://invent.kde.org/documentation/docs-krita-org/-/merge_requests/254
Comment 2 Bug Janitor Service 2021-09-15 08:01:01 UTC
A possibly relevant merge request was started @ https://invent.kde.org/documentation/docs-krita-org/-/merge_requests/254
Comment 3 Alvin Wong 2021-09-15 16:07:43 UTC
Git commit 84c73ecb341e3c765ddb58b74642dc2453b52df4 by Alvin Wong.
Committed on 14/09/2021 at 14:35.
Pushed by alvinwong into branch 'master'.

Add Jieba user dict for zh_CN and zh_TW

This makes the term for "tablet" in Chinese searchable.

M  +10   -0    conf.py
A  +1    -0    jieba-dict-zh_CN.txt
A  +1    -0    jieba-dict-zh_TW.txt

https://invent.kde.org/documentation/docs-krita-org/commit/84c73ecb341e3c765ddb58b74642dc2453b52df4
Comment 4 Alvin Wong 2021-09-15 16:08:06 UTC
Git commit 9fe299135461d57964547b52ad2b27a093f83d45 by Alvin Wong.
Committed on 15/09/2021 at 16:08.
Pushed by alvinwong into branch 'krita/5.0'.

Add Jieba user dict for zh_CN and zh_TW

This makes the term for "tablet" in Chinese searchable.


(cherry picked from commit 84c73ecb341e3c765ddb58b74642dc2453b52df4)

M  +10   -0    conf.py
A  +1    -0    jieba-dict-zh_CN.txt
A  +1    -0    jieba-dict-zh_TW.txt

https://invent.kde.org/documentation/docs-krita-org/commit/9fe299135461d57964547b52ad2b27a093f83d45
Comment 5 Alvin Wong 2021-09-16 07:08:19 UTC
Git commit 7a805c65d8333c394f705924bdc06aa012edcfdc by Alvin Wong.
Committed on 16/09/2021 at 07:07.
Pushed by alvinwong into branch 'master'.

Use absolute path to Jieba user dict

This should fix Jieba dict usage on binary-factory.

M  +6    -2    conf.py

https://invent.kde.org/documentation/docs-krita-org/commit/7a805c65d8333c394f705924bdc06aa012edcfdc
Comment 6 Alvin Wong 2021-09-16 07:08:38 UTC
Git commit bc03e95d3c9bbf910318bffa9d7a7ed0849ceab4 by Alvin Wong.
Committed on 16/09/2021 at 07:08.
Pushed by alvinwong into branch 'krita/5.0'.

Use absolute path to Jieba user dict

This should fix Jieba dict usage on binary-factory.


(cherry picked from commit 7a805c65d8333c394f705924bdc06aa012edcfdc)

M  +6    -2    conf.py

https://invent.kde.org/documentation/docs-krita-org/commit/bc03e95d3c9bbf910318bffa9d7a7ed0849ceab4
Comment 7 Tyson Tan 2021-09-17 01:39:08 UTC
I have a question:
Currently the Chinese translation for Opacity (不透明度) and transparency (透明度),only 透明度 is being recognized. How do we handle this? If I added 不透明度 to the dictionary, will it make 透明度 to fail?
Comment 8 Alvin Wong 2021-09-17 15:16:27 UTC
(In reply to Tyson Tan from comment #7)
> I have a question:
> Currently the Chinese translation for Opacity (不透明度) and transparency
> (透明度),only 透明度 is being recognized. How do we handle this? If I added 不透明度
> to the dictionary, will it make 透明度 to fail?

It depends, but the best way to tell is to build and test it locally. In theory, since Sphinx uses the `cut_for_search` method, it can produce multiple overlapping terms from a phrase. That is, if both "透明度" and "不透明度" exists in the dictionary, it should produce both terms for indexing. You may explicitly add both terms to the user dictionary to be on the safe side.
Comment 9 Tyson Tan 2021-09-17 16:09:19 UTC
Thanks for the explaination!
Soooo... is there no downside for being excessive on the terms?
And is there a way to add comments to the dictionary?
Comment 10 Alvin Wong 2021-09-20 14:34:39 UTC
(In reply to Tyson Tan from comment #9)
> Thanks for the explaination!
> Soooo... is there no downside for being excessive on the terms?

I can't answer that. The algorithm of Jieba is mostly unknown to me.

> And is there a way to add comments to the dictionary?

I don't think there is any.