Bug 440246 - Chinese manual search term segmentation issues
Summary: Chinese manual search term segmentation issues
Status: RESOLVED FIXED
Alias: None
Product: krita
Classification: Applications
Component: Documentation (other bugs)
Version First Reported In: nightly build (please specify the git hash!)
Platform: Other Other
: NOR normal
Target Milestone: ---
Assignee: Krita Bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-25 06:05 UTC by Alvin Wong
Modified: 2021-08-25 14:43 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alvin Wong 2021-07-25 06:05:58 UTC
(In reply to Tyson Tan from bug 419321)
> Actually, I don't think the current Search box is working properly for CJK
> languages yet. For example, it can seach "笔刷" and return some results, but
> searching "笔刷预设" returns nothing. There must be some issues with the word
> dividing logic.

- `jieba` split it into two separate terms "笔刷" and "预设" in the search index.
- The client side searching code can only split terms by whitespace, so "笔刷预设" (or any continuous CJK chars for the matter) is considered one term.
- The searching code probably only finds exact matches.

(In case you are interested, you can check the generated index at [1] -- paste its contents into a JS beautifier [2] and enable "Unescape printable chars encoded as \xNN or \uNNNN".)

There are ways to provide `jieba` with custom dictionary terms and segmentation rules (check its readme [3] for more info). I think we can initialize them from `conf.py` if you would like to add some.

However, *if* we make "笔刷预设" a full term in the index, then it seems likely that the search term "笔刷" will not be able to yield the results indexed with the term "笔刷预设", which might be worse than it currently is.

We can probably see if there are any improvements in the upstream `sphinx_rtd_theme` search code to be backported, but most likely we will have to hack together something for matching search terms to get the behaviour we want. 

[1]: https://docs.krita.org/zh_CN/searchindex.js
[2]: https://beautifier.io/
[3]: https://github.com/fxsjy/jieba
Comment 1 Alvin Wong 2021-08-25 14:18:33 UTC
Git commit f67b09964d2ba0547bdfa5e73fa11ddcb0c98fe6 by Alvin Wong.
Committed on 25/08/2021 at 14:10.
Pushed by alvinwong into branch 'master'.

Try to split search term into smaller parts for zh and ja

When sphinx generates the search index, terms gets split into the
smallest logical part, for example "笔刷预设介绍" will be split into
three individual terms - "笔刷", "预设" and "介绍". The search page
JavaScript does not know how to do segmentation (wouldn't be feasible
anyway due to the need of a dictionary). Therefore here we add an
extra logic to attempt to further split the search terms according to
available terms in the search index to make the search function more
useful for Chinese and Japanese languages.

This logic requires that every part of the search term to be an existing
term in the index, like "笔刷预设介绍". If the search term is instead
"笔刷预设道路" and that "道路" does not exist in the index, this logic
will not apply the split and the search will yield no results.

M  +52   -0    theme/static/searchtools.js_t

https://invent.kde.org/documentation/docs-krita-org/commit/f67b09964d2ba0547bdfa5e73fa11ddcb0c98fe6
Comment 2 Alvin Wong 2021-08-25 14:19:13 UTC
Git commit 2d63f23f4b21a182812fda8755ca94a58e661c01 by Alvin Wong.
Committed on 25/08/2021 at 14:19.
Pushed by alvinwong into branch 'krita/5.0'.

Try to split search term into smaller parts for zh and ja

When sphinx generates the search index, terms gets split into the
smallest logical part, for example "笔刷预设介绍" will be split into
three individual terms - "笔刷", "预设" and "介绍". The search page
JavaScript does not know how to do segmentation (wouldn't be feasible
anyway due to the need of a dictionary). Therefore here we add an
extra logic to attempt to further split the search terms according to
available terms in the search index to make the search function more
useful for Chinese and Japanese languages.

This logic requires that every part of the search term to be an existing
term in the index, like "笔刷预设介绍". If the search term is instead
"笔刷预设道路" and that "道路" does not exist in the index, this logic
will not apply the split and the search will yield no results.


(cherry picked from commit f67b09964d2ba0547bdfa5e73fa11ddcb0c98fe6)

M  +52   -0    theme/static/searchtools.js_t

https://invent.kde.org/documentation/docs-krita-org/commit/2d63f23f4b21a182812fda8755ca94a58e661c01
Comment 3 Alvin Wong 2021-08-25 14:43:11 UTC
Git commit f1068941ba2b3918630b2e03f99bdb92864b683c by Alvin Wong.
Committed on 25/08/2021 at 14:41.
Pushed by alvinwong into branch 'master'.

Add the split search terms (zh & ja) to be highlighted

M  +4    -2    theme/static/searchtools.js_t

https://invent.kde.org/documentation/docs-krita-org/commit/f1068941ba2b3918630b2e03f99bdb92864b683c
Comment 4 Alvin Wong 2021-08-25 14:43:57 UTC
Git commit f8dfe32226a6258a916c3dbc600928001ac96a3e by Alvin Wong.
Committed on 25/08/2021 at 14:43.
Pushed by alvinwong into branch 'krita/5.0'.

Add the split search terms (zh & ja) to be highlighted


(cherry picked from commit f1068941ba2b3918630b2e03f99bdb92864b683c)

M  +4    -2    theme/static/searchtools.js_t

https://invent.kde.org/documentation/docs-krita-org/commit/f8dfe32226a6258a916c3dbc600928001ac96a3e