Bug 433467

Summary: doxygen.xml [and others] have invalid "[]" in regex
Product: [Frameworks and Libraries] frameworks-syntax-highlighting Reporter: Gene Thomas <gene>
Component: syntaxAssignee: KWrite Developers <kwrite-bugs-null>
Status: RESOLVED NOT A BUG    
Severity: normal CC: jonathan.poelen, walter.von.entferndt
Priority: NOR    
Version First Reported In: unspecified   
Target Milestone: ---   
Platform: Other   
OS: Other   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Gene Thomas 2021-02-23 06:58:25 UTC
SUMMARY

[] appears in regexs. That means a single character but is not allowed to be anything, there is nothing between the [ and ]. The ICU regex engine I am using rejects this.

STEPS TO REPRODUCE
1. Read doxygen.xml
2. It declares and entity wordsep as "(?:[][,?;()]|\.$|\.?\s)"
3. This entity is used in RegExpr's

OBSERVED RESULT

This is ok

EXPECTED RESULT

Should be an error and the .xml corrected

SOFTWARE/OS VERSIONS
Windows: 
macOS: 
Linux/KDE Plasma: 
(available in About System)
KDE Plasma Version: 
KDE Frameworks Version: 
Qt Version: 

head of https://github.com/KDE/syntax-highlighting

ADDITIONAL INFORMATION
Comment 1 Jonathan Poelen 2021-02-27 23:57:39 UTC
[]] is valid with PCRE (regex engine used) where ] as the first character does not correspond to a closure (same with [^]]).

ICU regex does not seem to support all PCRE syntax, it lacks for example (?|...) or \R which are also used.
Comment 2 Gene Thomas 2021-03-03 00:00:22 UTC
Thanks, I've switched from ICU to PCRE, much faster. Part of the problem is that ICU jumps through hoops to be correct. For example in German the regex (case insensitive) "^ẞ$" matches "SS" [2 code points], no other regex implementations do this that I have seen. ICU was getting into a internal infinite loop and throwing a "regex out of stack space" after 0.5 sec, lots of times, which made a .sh file take 30 seconds to syntax highlight!