Bug 433924 - regular expressions unicode support
Summary: regular expressions unicode support
Status: RESOLVED FIXED
Alias: None
Product: kate
Classification: Applications
Component: search (show other bugs)
Version: 20.12.2
Platform: Other Microsoft Windows
: NOR normal
Target Milestone: ---
Assignee: KWrite Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-03 20:58 UTC by peter.verkinderen
Modified: 2021-03-08 08:20 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In: 5.81.0


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description peter.verkinderen 2021-03-03 20:58:49 UTC
SUMMARY

Regular expressions in Kate search are not optimized for non-Western languages (or even, any language that uses non-ASCII letters). 

For example, the /\w/ regex only matches ASCII letters (+ underscore), not all unicode letter characters as it does in most modern programming languages and editors. /\d/ likewise only covers ASCII digits, not digits in other scripts. This makes it exceedingly difficult to write a regex for languages with non-Latin script, or even languages like French that use a good number of non-ASCII characters. 

POSIX character classes are implemented but also only support ASCII characters. It would be great if Unicode classes/properties would be supported. 



STEPS TO REPRODUCE
1. write any text that contains non-ASCII letters, e.g., "café  القهوة"
2. try to match words using /\w+/

(obviously very simplistic example, but imagine writing any regex without being able to use \w, \d, or [a-zA-Z])

OBSERVED RESULT

only the letters `caf` are matched

EXPECTED RESULT

`café` and `القهوة` should be matched

SOFTWARE/OS VERSIONS
Windows: 
macOS: 
Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

ADDITIONAL INFORMATION
Comment 1 Christoph Cullmann 2021-03-06 21:56:32 UTC
Yeah, seems we missed to set some needed flag during the QRegExp => QRegularExpression port.

https://invent.kde.org/frameworks/ktexteditor/-/issues/10
Comment 2 peter.verkinderen 2021-03-07 07:55:34 UTC
Thanks for the quick action!
Comment 3 Kåre Särs 2021-03-07 11:10:14 UTC
Git commit 675eaa6eebdbdf5437b7d150ae907283cb6ccb81 by Kåre Särs.
Committed on 07/03/2021 at 09:38.
Pushed by cullmann into branch 'master'.

S&R: Add UseUnicodePropertiesOption to regexps

To make regular expressions work properly with Unicode add
UseUnicodePropertiesOption option

(Search & Replace plugin)
Related: bug 433673

M  +4    -2    addons/search/plugin_search.cpp

https://invent.kde.org/utilities/kate/commit/675eaa6eebdbdf5437b7d150ae907283cb6ccb81
Comment 4 Christoph Cullmann 2021-03-07 13:57:29 UTC
Fixes in KTextEditor are there now, too:

https://invent.kde.org/frameworks/ktexteditor/commit/fb35c7fd42ec6576121f3dc8cb59896133c4e433

Thanks for your report!

We have really just missed that, we have now some unit test, too.
Comment 5 peter.verkinderen 2021-03-07 18:10:09 UTC
Impressive speed and teamwork on this issue - on a Sunday! Thank you very much to all involved!

Will these changes be reflected in the nightly installer for Windows? Or does it take a while before they are included?
Comment 6 Christoph Cullmann 2021-03-08 08:20:52 UTC
The next frameworks release 5.81 will have the fixes to the part, that means binary factory will have that in a bit over a month.

The application fixes should be directly visible in the nightly builds there.