Bug 303654 - Optimize RegExpCache
Summary: Optimize RegExpCache
Status: RESOLVED UNMAINTAINED
Alias: None
Product: nepomuk
Classification: Miscellaneous
Component: general (show other bugs)
Version: git master
Platform: unspecified Linux
: NOR task
Target Milestone: ---
Assignee: Nepomuk Bugs Coordination
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-17 09:07 UTC by Vishesh Handa
Modified: 2015-01-23 16:25 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
Patch includes a new RegExpCache class, can be applied to master branch. (11.50 KB, patch)
2013-04-07 10:41 UTC, Lukasz Olender
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vishesh Handa 2012-07-17 09:07:13 UTC
The RegExpCache is a list of regular expressions. It works by iterating over all the regular expressions and checking if any matched. Most of the regular expressions are generally file urls.

Find a way to combine the regular expressions together in order to avoid iterating over the entire list.

Relevant code: nepomuk-core/common/regexpcache.*
Priority: Low

Ideally one should write some benchmarks to see if there is any actual speedup.
Comment 1 Lukasz Olender 2013-04-07 10:41:11 UTC
Created attachment 78701 [details]
Patch includes a new RegExpCache class, can be applied to master branch.

Proposed solution transforms all filters from wildcard syntax to full regexp syntax and combines them into one big expression. For example this five filters: "*.pyc, *.po, *.tmp, *.orig, autom4te" will be converted to: "(.*\.(pyc|po|tmp|orig)|autom4te)".

For default nepomuk filters and almost all files on my computer (fresh opensuse distribution with kde sources), testing one big regexp is about five times faster than comparing file names with every filter. For small amount of filters (about 4) old regexp cache is noticeably faster, so in that case this approach is in use.

Provided patch is tested and works well with full wildcard syntax ('?', "*" and "[ ]" operators). Special full regexp characters not supported in wildcard format are treated like normal characters and escaped properly. Just in case if something went wrong and output regexp is not valid, it switch to old behaviour and logs error to kDebug.

Isolated testcases can be downloaded from http://treadstone.sh.dug.net.pl/regexp_cache.zip . I've also included a small list of files (about 2.8MB), but if you want to get more valuable speedup result, you should provide much bigger list (I've tested it with 25MB file).
Comment 2 Vishesh Handa 2013-04-08 20:18:00 UTC
Hey Lucas

Wow! This is quite amazing. However, I think it would be better if you uploaded the patch on the reviewboard. That way the other developers can also have a look at it as well.

It'll be a little hard, but it would be nice to commit your tests as well. That way everyone will know why this version is faster.

Do you want any help in organizing the files properly?
Comment 3 Lukasz Olender 2013-04-09 07:25:42 UTC
Hi Vishesh,

Yes, I surely need some informations about those tests. How should they look like and what should be compared? I can also make some kind of unit tests and put them somewhere in repo. Also, part of code responsible for regexp generation is quite complicated and someone might need some time to get involved into - should I put more inline comments to make itmore understandable at first glance?

Reviewboard's url for nepomuk is: https://git.reviewboard.kde.org?
Comment 4 Vishesh Handa 2015-01-23 16:25:07 UTC
Thank you for taking the time to file a bug report.

The Nepomuk project is no longer included in the KDE Software Compilation. With Plasma 5, we have replaced most of the underlying technology with Baloo and other components. Hopefully this will have addressed your concern.

We encourage you to try to Plasma 5 (+Baloo) and let us know if your problem persists.