Bug 100107 - backward searching a regexp does not work properly
Summary: backward searching a regexp does not work properly
Status: RESOLVED WORKSFORME
Alias: None
Product: kate
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: unspecified Linux
: NOR normal
Target Milestone: ---
Assignee: KWrite Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-02-23 19:11 UTC by Jeroen Wijnhout
Modified: 2005-02-26 15:06 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeroen Wijnhout 2005-02-23 19:11:05 UTC
Version:           2.4 (using KDE 3.3.89 (CVS >= 20041129), compiled sources)
Compiler:          gcc version 3.3.3 (SuSE Linux)
OS:                Linux (i686) release 2.6.5-7.111.30-default

Type something like

\begin{document}

and search backwards for the regular expression 
'\\begin\{([A-Za-z]+)\}'. 

The LaTeX environment will be found, when it starts in column 0. In all other cases it won't be found. That is
it will find

\begin{document}

but it won't find 

test\begin{document}

Forward search works in all cases.

It is very important for Kile to get this to work again, hopefully before KDE 3.4.

best,
Jeroen
Comment 1 Jeroen Wijnhout 2005-02-23 19:13:21 UTC
Note that KTextEditor::SearchInterface::searchText(...) suffers from the same
problem. So perhaps the bug is in the part.
Comment 2 Anders Lund 2005-02-23 20:16:08 UTC
On Wednesday 23 February 2005 19:11, Jeroen Wijnhout wrote:
> It is very important for Kile to get this to work again, hopefully before
> KDE 3.4.

It works completely fine here.
Anyone else?

-anders
Comment 3 Jeroen Wijnhout 2005-02-23 20:29:19 UTC
There is no KDE_3_4 branch yet, right? So we must be using the same version. Unless the Qt version matters. I will do some more testing tomorrow.
Comment 4 Dominik Haumann 2005-02-23 20:30:55 UTC
I have Qt 3.3.4.
in KDE 3.3.2 (Gentoo) it does not work.
in HEAD it *works* for me.
Comment 5 Anders Lund 2005-02-23 20:35:29 UTC
On Wednesday 23 February 2005 20:29, Jeroen Wijnhout wrote:
> There is no KDE_3_4 branch yet, right? So we must be using the same
> version. Unless the Qt version matters. I will do some more testing
> tomorrow.

I'm using cvs HEAD which is unchanged since I committed my last changes 
yesterday. The Qt version should not matter. I have no idea what the problem 
was, or when it was fixed, even it is probably done by me. Maybe I can find 
it in the commit log. But it should work in cvs HEAD no matter what.

-anders

Comment 6 Jeroen Wijnhout 2005-02-24 11:11:21 UTC
I can confirm that it works in HEAD (my build system was a bit messed up, fixed that now). So the bug is only in KDE 3.3.x. I would appreciate if you could give a hint on how to workaround this, since Kile should work with KDE 3.2,3.3 and 3.4.

best,
Jeroen
Comment 7 Anders Lund 2005-02-24 12:32:09 UTC
On Thursday 24 February 2005 11:11, Jeroen Wijnhout wrote:
> I would appreciate if you could give a hint on how to workaround this,
> since Kile should work with KDE 3.2,3.3 and 3.4.

Well, it seems to be a bug. 

I can't see how it can be worked around, except from fixing the code. But we 
don't release those branches, so even after doing that, it would require your 
users built from the branch.

The change that does the difference is in kdelibs/kate/part/katetextline.cpp. 
The following diff is the result of me fixing the issue for regex search in 
hte 3.3 branch locally:

Index: katetextline.cpp
===================================================================
RCS file: /home/kde/kdelibs/kate/part/katetextline.cpp,v
retrieving revision 1.77.4.4
diff -u -b -r1.77.4.4 katetextline.cpp
--- katetextline.cpp    28 Sep 2004 08:21:10 -0000      1.77.4.4
+++ katetextline.cpp    24 Feb 2005 11:25:58 -0000
@@ -253,6 +253,9 @@
   if (backwards)
   {
     int col = startCol;
+
+    // allow finding the string ending at eol
+    if ( col == m_text.length() ) startCol++;
     do {
       index = regexp.searchRev (m_text, col);
       col--;


There is a similar fix for text search, which should be applied as well to 
backport the whole fix, it's marked with the same comment in katetextline.cpp 
in HEAD. The methods to fix are both versions of KateTextLine::searchText().

Would it be of any value to you if I backported the fix to both 3_3 and 3_2 
branches?

-anders
Comment 8 Jeroen Wijnhout 2005-02-24 13:57:29 UTC
On Thursday 24 February 2005 12:32 pm, Anders Lund wrote:
> There is a similar fix for text search, which should be applied as well to
> backport the whole fix, it's marked with the same comment in
> katetextline.cpp in HEAD. The methods to fix are both versions of
> KateTextLine::searchText().
>
> Would it be of any value to you if I backported the fix to both 3_3 and 3_2
> branches?

No, don't bother. At least now I know where it is coming from. Guess I will 
have to encourage KDE 3.3 users to upgrade to KDE 3.4 ;-) Thanks for looking 
into it.

best,
Jeroen

Comment 9 Philippe Rigault 2005-02-25 15:33:54 UTC
QRegExp::searchRev() does not _have_ a bug, the function _is_ itself a bug and should be avoided altogether (explaination below, sorry if it is a bit lengthy).

QRegExp::searchRev() confuses a lot of people, who wrongly think that is searches a regexp backwards from the end of a string, and therefore is symmetrical to QRegExp::search().

This is *NOT* what QRegExp::searchRev() does.

What it does is both complicated and counterintuitive, two reasons making it worse than just useless.

Here is how I learned exactly how QRegExp::searchRev() works. I filed a bug to Trolltech about this in December ([Issue N62377]) because of another bug: http://bugs.kde.org/show_bug.cgi?id=94726 

> Hello,
>
> Qt version: qt-3.3.3
> Compiled from sources.
> Plarform: Linux, i686
> Compiler: gcc-3.4.2 (Fedora Core 3) and gcc-3.3.3 (Fedora Core 1)
>
> I think that QRegExp::searchRev() has a bug, demonstrated by the following
> code:
>
> #include <iostream>
> #include <qregexp.h>
> #include <qstring.h>
>
> using namespace std;
>
> int main (int argc, char *argv[]) {
>     QString name = "foo123.png";
>     QRegExp rx("\\d+");
>
>     int i = rx.search(name);
>     int j = rx.matchedLength();
>
>     cout << "search i=" << i << " j=" << j << endl;
>
>     i = rx.searchRev(name);
>     j = rx.matchedLength();
>
>     cout << "searchRev i=" << i << " j=" << j << endl;
>
> }
>
> $ ./test_regexp
> search i=3 j=3
> searchRev i=5 j=1
>
> It shows that the QRegExp::searchRev() function matches only the last digit
> of the regexp ("3") instead of the whole thing ("123").
> QRegExp::search() behaves correctly (i.e matches "123").
>

And here is their answer:

> Hi Philippe,
>
> On Sunday, 12. Dec 2004 00:06 Philippe Rigault wrote:
>
> [snip]
>
> > It shows that the QRegExp::searchRev() function matches only the last
> > digit of  the regexp ("3") instead of the whole thing ("123").
> > QRegExp::search() behaves correctly (i.e matches "123").
>
> This is actually by design, QRegExp::searchRev() will still search
> forward, but it will just start at the end, and work it's way
> backwards.
>
> Hope this helps.
>
> --
> Jan Erik Hanssen
> Trolltech AS, Waldemar Thranes gate 98, NO-0175 Oslo, Norway

You can now figure out what QRegExp::searchRev() does in plain english:
Given a string, it searches _forward_ for a regexp in successive substrings consisting first of the last character of the string, then the last two characters, and so on.
Contrast it with QRegExp::search(), which also searches forward for a regexp in successive substrings, but these substrings consist first of the entire string, then the substring starting at the second character, and so on.

This design of QRegExp::searchRev(), in addition to being counterintuitive, poses a major problem with regards to greediness, as I objected to Trolltech:

> Hi Jan,
>
> Thank you for your rapid response.
>
> This renders searchRev() directly incompatible with the very notion of
> greediness in regular expressions with quantifiers.
> What people expect is that:
> - with setMinimal(FALSE), the search is greedy
> - with setMinimal(TRUE), the search is notgreedy
>
> Your definition of QRegExp::searchRev() _guarantees_ that the search can
> never be greedy, and that setMinimal() has no effect on
> QRegExp::searchRev(), which is unlike any other function and absolutely
> counterintuitive.
>
> If this is really the case (and then QRegExp::searchRev() would be a pretty
> useless function because it cannot use quantifiers properly), this needs to
> be _heavily_ documented.
>
> Now I think that the design part is the bug.
>
> Best regards,

My opinion is therefore that QRegExp::searchRev() should always be avoided:
  - because in most cases, a proper use of QRegExp::search() can do the job (for example, if one really wants the symmetrical action of QRegExp::search(), then use it on the reversed string)
  - because it is incompatible with greediness, as demonstrated above
  - because each inclusion of this function in the code would require _heavy_ warning comments to remind people that this is not a straightforward function.

Best regards,

Philippe Rigault

Comment 10 Anders Lund 2005-02-25 20:11:17 UTC
On Friday 25 February 2005 15:33, Philippe Rigault wrote:
> What it does is both complicated and counterintuitive, two reasons making
> it worse than just useless.

I'm not sure why you suddenly attack qregexp. The problem is solved, it does 
work in current kate.

-anders
Comment 11 Philippe Rigault 2005-02-25 21:16:31 UTC
> I'm not sure why you suddenly attack qregexp. The problem is solved, it does
> work in current kate. 

Please read my comment correctly. I am _encouraging_ people to use qregexp correctly by avoiding the only one of its member functions (searchRev) that is problematic, and I have explained why.

I wish people used regexes more, and more correctly :-)

The whole problem with this bogus function is that it gives the false impression that it brings useful functionality, whereas in fact it dramatically contrasts with some fundamental properties expected by regexp users, greediness being a _very_ fundamental one. For example, you will never by able with this function to capture something as simple as the last consecutive string of digits in a string (something easily done with search()). How is that for a regexp member function called searchRev() ?

Problem solved or not for this bug, in my opinion any use of QRegExp::searchRev() is a time bomb, ready to cause havoc by the mere addition of quantifiers (because it scans forward). People start using it with a trivial regexp just because it avoids a loop or crafting a correct regexp, but when the regexp becomes a real one, with quantifiers and greediness, bugs will inevitably arise because QRegExp::searchRev() by its design does not work as a classical regexp engine, or as its name would suggest.

Fortunately, I have seen in the whole KDE only a few uses of QRegExp::searchRev(), all of them easily replaceable with search(), but all of them will cause problems in the future should the quantifiers require greediness.

Comment 12 Anders Lund 2005-02-25 21:31:12 UTC
You need to come up with a suggestion as to what to do instead. And why 
doesn't trolltech fix the bug in their code instead of us having to work 
around it?
Comment 13 Richard Smith 2005-02-26 14:41:56 UTC
On Friday 25 February 2005 14:33, Philippe Rigault wrote:
> QRegExp::searchRev() does not _have_ a bug, the function _is_
> itself a bug and should be avoided altogether (explaination below, sorry if
> it is a bit lengthy).
[...]

I do agree that Kate's behaviour is incorrect here, but your objection to 
searchRev() doesn't seem right to me. Imagine I have the following regular 
expression:

abcdef|[b-e]+

And I'm matching against the string:

foobarabcdefg

Now, where search() match this? Well, it matches at (in the order that the 
string is found):

start:end (matching letters in capitals)
3:4 (fooBarabcdefg)
6:12 (foobarABCDEFg)
7:11 (foobaraBCDEfg)
8:11 (foobarabCDEfg)
9:11 (foobarabcDEfg)
10:11 (foobarabcdEfg)

Right? So, what can a reverse search for this regexp possibly do? If you want 
it symmetric with search(), searchRev() *must* find 10:11 first, like it 
currently does. Then it must find 9:11. Then 8:11. And so on. These are the 
latest-starting greedy matches.

If you want the latest-finishing greedy matches, it'd have to match in this 
order:

6:12 (foobarABCDEFg)
7:11 (foobaraBCDEfg)
7:10 (foobaraBCDefg)
7:9 (foobaraBCdefg)
7:8 (foobaraBcdefg)
3:4 (fooBarabcdefg)

which is not a reverse search: the starting point of the match is sometimes 
moving *forwards*: it finds 'abcdef' then finds 'bcde'. And it's certainly 
not symmetric with search(): it finds different matches!

How would you like reverse search with regexps to work at the Qt API level?

Now, how Kate deals with regexp searching is another matter. Using the same 
regexp and search string, it matches:

3:4 (fooBarabcdefg)
6:12 (foobarABCDEFg)

and no other places. Reverse search should clearly find the same things in the 
opposite order, so should match:

6:12 (foobarABCDEFg)
3:4 (fooBarabcdefg)

but actually matches:

10:11 (foobarabcdEfg)
6:12 (fooBarabcdefg)

Interestingly, in this case, the last-finishing greedy matches are exactly 
what's wanted here: if we look for nonintersecting matches backwards using 
last-finishing matching, we match at 6:12 then at 3:4, like we should (and 
this works in general too).

Unfortunately, Qt doesn't provide a function to do this kind of searching, 
which makes it non-trivial for Kate to implement. There is a way, however 
(but it is nasty to implement): you can reverse the regexp (which is hard) 
and reverse the search string (which is much easier):

Regexp: [b-e]+|fedcba
string: gfedcbaraboof

With these, Kate now matches:
gFEDCBAraboof
gfedcbaraBoof

which, when reversed back, give us the correct matches in the correct order. 
This reversing is equivalent to doing a last-finishing greedy match. Clearly, 
this isn't going to be implemented for KDE3.4, but does seem to be a valid 
Kate bug.
Comment 14 Anders Lund 2005-02-26 15:06:40 UTC
There is actually no way to change the greediness of the regex search through 
the GUI, and it looks like our search code blindly accepts Qts defaults.

Of course we'd have to manually retri untill matching failed when searching 
backwards no matter what. I'm not sure if it's worth it. I see this a a bug 
in QRegExp mostly, except maybe we do wrong in limiting the string the way we 
do when searching backward (that must be why we have less matches than a bare 
regex).

In any way, getting it to work would be nice indeed.

-anders