Bug 303654

Summary: Optimize RegExpCache
Product: nepomuk Reporter: Vishesh Handa <me>
Component: generalAssignee: Nepomuk Bugs Coordination <nepomuk-bugs>
Status: RESOLVED UNMAINTAINED    
Severity: task CC: lukiasz
Priority: NOR    
Version: git master   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In:
Attachments: Patch includes a new RegExpCache class, can be applied to master branch.

Description Vishesh Handa 2012-07-17 09:07:13 UTC
The RegExpCache is a list of regular expressions. It works by iterating over all the regular expressions and checking if any matched. Most of the regular expressions are generally file urls.

Find a way to combine the regular expressions together in order to avoid iterating over the entire list.

Relevant code: nepomuk-core/common/regexpcache.*
Priority: Low

Ideally one should write some benchmarks to see if there is any actual speedup.
Comment 1 Lukasz Olender 2013-04-07 10:41:11 UTC
Created attachment 78701 [details]
Patch includes a new RegExpCache class, can be applied to master branch.

Proposed solution transforms all filters from wildcard syntax to full regexp syntax and combines them into one big expression. For example this five filters: "*.pyc, *.po, *.tmp, *.orig, autom4te" will be converted to: "(.*\.(pyc|po|tmp|orig)|autom4te)".

For default nepomuk filters and almost all files on my computer (fresh opensuse distribution with kde sources), testing one big regexp is about five times faster than comparing file names with every filter. For small amount of filters (about 4) old regexp cache is noticeably faster, so in that case this approach is in use.

Provided patch is tested and works well with full wildcard syntax ('?', "*" and "[ ]" operators). Special full regexp characters not supported in wildcard format are treated like normal characters and escaped properly. Just in case if something went wrong and output regexp is not valid, it switch to old behaviour and logs error to kDebug.

Isolated testcases can be downloaded from http://treadstone.sh.dug.net.pl/regexp_cache.zip . I've also included a small list of files (about 2.8MB), but if you want to get more valuable speedup result, you should provide much bigger list (I've tested it with 25MB file).
Comment 2 Vishesh Handa 2013-04-08 20:18:00 UTC
Hey Lucas

Wow! This is quite amazing. However, I think it would be better if you uploaded the patch on the reviewboard. That way the other developers can also have a look at it as well.

It'll be a little hard, but it would be nice to commit your tests as well. That way everyone will know why this version is faster.

Do you want any help in organizing the files properly?
Comment 3 Lukasz Olender 2013-04-09 07:25:42 UTC
Hi Vishesh,

Yes, I surely need some informations about those tests. How should they look like and what should be compared? I can also make some kind of unit tests and put them somewhere in repo. Also, part of code responsible for regexp generation is quite complicated and someone might need some time to get involved into - should I put more inline comments to make itmore understandable at first glance?

Reviewboard's url for nepomuk is: https://git.reviewboard.kde.org?
Comment 4 Vishesh Handa 2015-01-23 16:25:07 UTC
Thank you for taking the time to file a bug report.

The Nepomuk project is no longer included in the KDE Software Compilation. With Plasma 5, we have replaced most of the underlying technology with Baloo and other components. Hopefully this will have addressed your concern.

We encourage you to try to Plasma 5 (+Baloo) and let us know if your problem persists.