Bug 339908 - baloo_file_extractor ignoring files it should not ignore because of regexp
Summary: baloo_file_extractor ignoring files it should not ignore because of regexp
Status: RESOLVED FIXED
Alias: None
Product: Baloo
Classification: Unmaintained
Component: Baloo File Daemon (other bugs)
Version First Reported In: 5.0.1
Platform: Compiled Sources Linux
: NOR major
Target Milestone: ---
Assignee: Dominik Cermak
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-12 18:10 UTC by Dominik Cermak
Modified: 2014-10-15 12:05 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In: 5.1.1
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dominik Cermak 2014-10-12 18:10:13 UTC
I discovered that I couldn't find some of my videos with baloo because they weren't indexed.
I went ahead and issued a "baloo_file_extractor" on one of the files in konsole and it told me
"<file> should not be indexed. Ignoring". Looking at the source code I found the reason:

commit 282c8dff201d19fd6dbaf42a07cb561b644c5b18
Author: Vishesh Handa <me@vhanda.in>
Date:   Tue Jun 17 16:04:49 2014 +0200

    RegExpCache: Use 'QRegularExpression' instead of "QRegExp"
    
    This results in a performance increase of almost 10x. This is especially
    important because with this we will now consume less cpu when checking
    which files should be indexed, and we will be faster.

The problem with QRegularExpression is that it doesn't support wildcards (see http://qt-project.org/doc/qt-5/qregularexpression.html#wildcard-matching). So the exclude filters now match way too much.
Example: There is "*.o" in the exclude filters, this was ok with QRegExp because it would have meant "Every file/folder ending with .o", but in regexp this means "Match 0 or more times any character (except newline) o". So every file/folder ending with o is ignored (that's the case for my videos).

So we could either revert to QRegExp or change the exclude filters to correct regular expressions. What's your opinion Vishesh?

Reproducible: Always

Steps to Reproduce:
1. Create a file or a folder ending with o
2. Try to index the file with baloo_file_extractor

Actual Results:  
Because the last character is o it matches the exclude filter part "*.o" and is ignored.

Expected Results:  
It should get indexed.
Comment 1 Dominik Cermak 2014-10-13 07:30:35 UTC
Please see https://git.reviewboard.kde.org/r/120570/ for my proposed fix.
Comment 2 Dominik Cermak 2014-10-15 12:05:46 UTC
Git commit 863ccc6f7901528338efabfef78098fc72cbd94f by Dominik Cermak.
Committed on 14/10/2014 at 11:36.
Pushed by cermak into branch 'Plasma/5.1'.

Escape dots in exclude filters

In regular expressions a dot (.) matches any character (except newline)
but in the exclude filters we use wildcard syntax (*) and want a dot (.)
to be interpreted as a character.

Example: With "*.o" in the exclude filters the user expects object
files (ending with .o) are excluded. Without escaping this would match
every file and folder ending with o though. This is the case for all
entries of that form in exlude filters ("*.moc", "*.la", etc.)

So just escape every dot we find in exclude filters with a backslash
while building the regexp.
FIXED-IN: 5.1.1
REVIEW: 120570

M  +1    -0    src/file/regexpcache.cpp

http://commits.kde.org/baloo/863ccc6f7901528338efabfef78098fc72cbd94f