175735 – regex '^' matching is incorrect

Bug 175735 - regex '^' matching is incorrect

Summary: regex '^' matching is incorrect

Status:	RESOLVED INTENTIONAL

Alias:	None

Product:	kate
Classification:	Applications
Component:	general (show other bugs)
Version:	unspecified
Platform:	unspecified Unspecified

Importance:	NOR wishlist
Target Milestone:	---
Assignee:	KWrite Developers

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-11-21 09:18 UTC by Maciej Pilichowski
Modified:	2015-10-08 08:59 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:
Version Fixed In:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Maciej Pilichowski 2008-11-21 09:18:23 UTC

Version: (using KDE 4.1.3)

do not count non-selected line as partially selected or provide appropriate visual indication

This is _not_ a duplicate of:
https://bugs.kde.org/show_bug.cgi?id=149872

The other one is about whole-line actions and everybody agreed that the behaviour for line-actions should be consistent and action pointed out (namely: sort) is inconsistent.

This report goes into different direction so I don't want to join them because then it would be one report with two issues.

Here, the problem is not technical, but rather about usability. Just a quick reminder.

In KDE3, having such text:

|aa
|

where | denotes selection boundaries meant:
selection starts _after_ (bug) the beginning of the first line and ends after end of the first line (i.e. before beginning the second one)

in KDE4 it means something different
selection starts _before_ the beginning of the first line and ends after beginning of the second line

So the bug is fixed, but also new "flaw" is introduced.

The problem is line "aa" is really selected -- I mean user can see it. When user selects text from up to down, well, she/he can see it too (mentally, I don't "see" it).
However, when you select from bottom to top, the selection technically is exactly the same, but there is no way user can see the selection -- visually it is only the line "aa".

So my points are (some personal, some design):
a) for me it feel much more intuitive/natural to have the second line beginning excluded (due to previous experience)
b) there is WYSIWYG violation -- it would be much consistent, if user gets reliable feedback, and actions are performed in accordance to the feedback

ad.b) it is wrong, as in such case, if user sees something, and kate "sees" something different

So, either those two should mean the same:

|aa|

and

|aa
|

I would think of the latter as analogy to the term of "syntax sugar" in progr.languages -- possible to express in another way, but this way is faster. And I opt for that approach.

Or -- another solution would be providing correct visual feedback.

Comment 1 Maciej Pilichowski 2008-11-21 09:28:45 UTC

PS. Just to add something for consideration -- another inconsistency:

|aa|

selects the end of the line, but this

|aa
|

(second line is empty) does not selects the end of the second line.


Hmm, it seems that in the first part the visual feedback is missing, in the second -- here -- it is a bug.

Comment 2 Andreas Pakulat 2008-11-21 14:48:20 UTC

Hmm, I think I don't understand what your problem is with the current selection behaviour and what the suggested fixes mean.

|aa
|

and 

|aa|

cannot "mean" the same thing, as they select different amounts of character. One includes the end-of-line, the other doesn't.

Is your problem that the cursor itself is on the second line? But thats correct as well, because you hit "down", so the cursor should go down one line.

Maybe I'm just too used to plain text editors, but to me the current behaviour feels quite natural.

Comment 3 Maciej Pilichowski 2008-11-21 15:06:33 UTC

There are to problems, and those are facts:
* the visual indication is not "correct" (it is misleading)
* the behaviour of selection boundaries are inconsistent (see #1)

Now, what should we do about it:
a) improve visual feedback
b) change selection behaviour

ad.a) I hope there is no vote against this

ad.b) and here is the big question -- how?

=====================================================
(b)
Now, despite it feels completely unnatural for me personally, I tend to think that placing cursor in empty line means -- whole line, with beginning (^) and the end ($).

Which means that placing cursor in any line next to ^ means match for ^, the same for $.

So the result may seem awkward, because this

|aa
|b

would mean
^aa$^
in selection

and this
aa
|b|

would mean
^b$

so note that there are two distinct sets, yet ^ (from b line) is included twice.

However if I am not mistaken it is only logical way. Why? Consider having file, pressing ctrl+a. Are ^ and $ included (of the 1 and last line)? They have to -- now, where are the boundaries of selection placed -- next to ^ and $. Now, note this also applies to empty file, or one-liner file. And thus -- my above description.

For now, I don't see another way to make this logical.

Comment 4 Andreas Pakulat 2008-11-21 15:52:08 UTC

I think you're simply missing out that kate doesn't show you the <eol> (end-of-line) but instead transforms it into a proper visual representation (which is putting the character after <eol> onto the next line). 

However a given line does not have to include an <eol>. For example if you create a new file and type "abc" into it, you don't have an <eol> in that. So Ctrl+A rightfully selects "abc", which is indicated by

|abc|

The <eol> part is important here, selecting a line means selecting not only the text in the line, but also the <eol>. And vice-versa, a selection that doesn't include the <eol> of it (which means the cursor is at the beginning of the next line) is not selected completely and will leave an empty line upon deletion.

I fail to see where this doesn't work or is visually indicated in the wrong way in kate.

Comment 5 Maciej Pilichowski 2008-11-21 16:03:42 UTC

> I think you're simply missing out that kate doesn't
> show you the <eol> (end-of-line) but instead transforms it into a
> proper visual representation (which is putting the character after
> <eol> onto the next line).

No, I am not not.

But from now, if you agree, let's talk about in regexp language, so it would be clear who means what. So let's use ^ and $.

> I fail to see where this doesn't work or is visually indicated in
> the wrong way in kate.

Ok, once again :-)

Misleading visual indication

|aaa
|bbb

Make this selection from a to b, and then from b to a. In the second case you have absolutely no indication that anything from b is selected.

And about wrong behaviour, select this

|aaa
bbb|

how many $ matches? --> 2. Now select also the third line (empty):

|aaa
bbb
|

how many $ matches? --> according to Kate: 2.

And this when Kate logic fails, in my previous comment I described how to fix this -- it would be (for me) unnatural, bug logical. And I prefer logic after all :-)

Comment 6 Andreas Pakulat 2008-11-21 17:17:25 UTC

I don't see why regex would help anything, because you're still assuming that the <eol> character == $, which is just wrong. "^" and "$" are artificial delimiters, where "^" is _before_ the first character of a line and "$" is behind the last character of a line. A line in kate includes the <eol> marker. Hence "$" means after the <eol> marker. So to be 100% correct kate should have:

|aaa |

But thats pretty awkward to most people using a plain-text editor I think (vim supports this in visual mode). Hence the cursor ends up in the next line.

Comment 7 Maciej Pilichowski 2008-11-21 17:25:54 UTC

> I don't see why regex would help anything, 

It is faster to type $ than EOL. Or <eol>. That's all. We could write B and E as well, I don't mind.

> because 
> you're still assuming that the <eol> character == $, 

I am not.

> which is just 
> wrong. "^" and "$" are artificial delimiters, where "^" is _before_
> the first character of a line and "$" is behind the last character
> of a line. A line in kate includes the <eol> marker.

I know.


The previously described limitations still holds -- Kate behaves inconsistently and indication of the selection is sometimes misleading.

I also suggested how to fix the behaviour so it would be always logical. No matter if you select entire file, one line, empty line, etc. You could always predict the behaviour purely logically not basing on known flaws in Kate and changes in KDE3->KDE4.

Comment 8 Andreas Pakulat 2008-11-21 18:30:33 UTC

Kate's behaviour is not inconsistent. To select a complete line you have to select: "aaa<eol>", inclusive. And the only way to present that fact to the user is to move the cursor _after_ the <eol>, which means at the beginning of the next line. Else KAte would need to introduce a <eol> special-character thats shown at the end of each line, which is kinda awkward and seldomly needed.

If you select from line 1 to line 2 via Shift+Down thats what happens. If you select from line 2 to line 1 via Shift+Up, then you get the whole line 1 selected and the cursor at its front, indicating that the whole line is selected. As <eol> is not a visible character in this case you won't see any selection on line 2 and there's also no direct indication possible inside of line 1.

So I still don't see where kate is inconsistent.

Comment 9 Maciej Pilichowski 2008-11-21 18:50:50 UTC

> So I still don't see where kate is inconsistent.

Ok, let's go from basics. Please answer this, but don't use Kate, just thinking.

You have a file, kate, and replace with regexp. How can you add "." at the end of each line through entire file?

Please, write the steps (don't cheat -- no using Kate for real ;-D).

Comment 10 Matthew Woehlke 2008-11-21 20:08:57 UTC

> You have a file, kate, and replace with regexp. How can you add "." at the end
> of each line through entire file? 

find: $
replace: .

...and it works just fine (which is to say, it works as it would if I used sed, i.e. '$' matches every occurrence of before-a-\n, as well as EOF /iff the last character is not \n/).

I'm not following this at all. Your report mentions selection, but there is nothing wrong with selection*. You also mention 'line-based actions' but don't give any concrete example what you are trying to do.

(*As the self-appointed master of selection, you're going to have to either explain an actual bug so that I understand you :-), or else work really, really hard to convince me that a behavior should be changed. But I'm not sure that's relevant, as I fixed selection in KDE3 also, and you mentioned something that is changed from KDE3, which is why I don't think you're really talking about selection, but something else entirely.)

The only thing I can find that feels at all fishy is this. Given this document ([] marks selection):

aaa
[bbb
]ccc
ddd

...s/^/./ affects both the 'bbb' and 'ccc' lines, which seems wrong (replacing $ works as expected). In fact, ^ in general seems to match EOF, which a: seems wrong, and b: is not how sed behaves.

Comment 11 Maciej Pilichowski 2008-11-21 22:15:26 UTC

> > You have a file, kate, and replace with regexp. How can you add
> > "." at the end of each line through entire file?
>
> find: $
> replace: .
>
> ...and it works just fine 

Matthew, it is about selection -- just a reminder, ok? ;-)

> (which is to say, it works as it would if 
> I used sed, i.e. '$' matches every occurrence of before-a-\n, as
> well as EOF /iff the last character is not \n/).

It is better not to compare because it should be logical, not the same (I found the differences in 3 seconds, before reading to the end of your comment ;-)).

Ok, so let's back to our test.

== means document boundaries, any line which looks empty it is really empty

This one, works as expected.
==
hello world
==
1 match

this too
==
hello 
world
==
2 matches

but this one not:
==
hello 
world

==
2 matches

Despite there are three lines, kate treats this as there were two lines. Now, it is even more odd if you do the same test with the ^ not the $. All of the examples works -- respectively (1 match, 2 matches, 3 matches). So -- in last case -- there 2 or 3 lines?

Logical fix -- when selection includes empty line it should work as the whole line is selected (so both ^ and $ matches). More examples below.

> I'm not following this at all. Your report mentions selection, but
> there is nothing wrong with selection*. You also mention
> 'line-based actions' but don't give any concrete example what you
> are trying to do.

Line-base action is for example intend.

> (*As the self-appointed master of selection, you're going to have
> to either explain an actual bug so that I understand you :-), or
> else work really, really hard to convince me that a behavior should
> be changed. 

The problem is that understanding what is really selected should rely on logic. Otherwise with each action user will be surprised that she/he cannot do reliable find/replace/something-else.

> But I'm not sure that's relevant, as I fixed selection 
> in KDE3 also, and you mentioned something that is changed from
> KDE3, which is why I don't think you're really talking about
> selection, but something else entirely.)

When you work with replace/find on selected text the behaviour of Kate differs in KDE3 and in KDE4. 

> The only thing I can find that feels at all fishy is this. Given
> this document ([] marks selection):
>
> aaa
> [bbb
> ]ccc
> ddd
>
> ...s/^/./ affects both the 'bbb' and 'ccc' lines, which seems wrong
> (replacing $ works as expected). 

With #3 I came up to conclusion that this is logical behaviour. Because selection should be interpreted in the same way no matter if part of file is a strict subset of a file or it is entire file.

> In fact, ^ in general seems to 
> match EOF, which a: seems wrong, and b: is not how sed behaves.

I don't agree -- (a) it is "correct". I mean -- it is logical.

aaa
[bbb]
ccc

should match ^ and $ for bbb

aaa
[bbb
]ccc

^ for bbb and ccc, $ for bbb

aaa
[bbb
ccc]

^ and $ for both bbb and ccc

Now, note what happens when bbb or ccc are empty lines, for example

aaa
[bbb
]

^ and $ for both bbb and ccc (now empty) -- the expected behaviour

the same (I mean, symmetric) -- again, expected behaviour

aaa
[
ccc]

But Kate does not work that way. So I am opting for logical, symmetric behaviour (let's ignore for now visual indication).

Comment 12 Matthew Woehlke 2008-11-22 01:09:54 UTC

I agree that '^' is not matching correctly. As for the rest, I argue that your concept of what '^' and '$' match is wrong. They do NOT match '\n'. They DO match the start/beginning of lines. A line does not necessarily have a '\n' at the end.

Further, in your example:

=
aaa
bbb

=

...there are two lines. They are 'aaa\n' and 'bbb\n'. The EOF following a '\n' is not a line. (It is perhaps unfortunate that kate works this way, rather than like vi, but that's for another discussion.)

Or, put another way, the file whose exact contents are 'aaa\nbbb\n' contains two lines, as agreed upon by sed, vi, wc, and incidentally by kate. (wc differs in that it does not recognize at all lines that do not end with '\n', while sed, vi, and kate do.)

> The problem is that understanding what is really selected should rely on
> logic. 

Again, you have utterly failed to convince me that there is anything wrong with selection. IMNSHO /selection/ is working exactly as it is intended to work.

> selection should be interpreted in the same way no matter if part of file is
> a strict subset of a file or it is entire file.

I agree. However, I fail to see where this is not happening.

> Logical fix -- when selection includes empty line it should work as the whole
> line is selected (so both ^ and $ matches).

Again, I agree. Luckily this is how kate behaves :-).

I will repeat: there is no line between '\n' and EOF (or EOS).

Comment 13 Maciej Pilichowski 2008-11-22 07:54:53 UTC

> Further, in your example:
>
> =
> aaa
> bbb
>
> =
>
> ...there are two lines. They are 'aaa\n' and 'bbb\n'. The EOF
> following a '\n' is not a line. (It is perhaps unfortunate that
> kate works this way, rather than like vi, but that's for another
> discussion.)
>
> Or, put another way, the file whose exact contents are 'aaa\nbbb\n'
> contains two lines, as agreed upon by sed, vi, wc, and incidentally
> by kate. (wc differs in that it does not recognize at all lines
> that do not end with '\n', while sed, vi, and kate do.)

I don't like the idea of backuping up with some other apps, but I found out that I could think of it as two lines and space made by kate to edit the third line.

Ok, with me. Thank you for enlightenment :-)

> Again, you have utterly failed to convince me that there is
> anything wrong with selection. IMNSHO /selection/ is working
> exactly as it is intended to work.

Ok, here are the explanations. New notation, more compact

^[aaa

this means that after selection ^ was not matched when find/replace

[^aaa

this means it was.

There are 4 possibilities

aaa
^[bbb$
^ccc]$
ddd

aaa
[^bbb$
^ccc$]
ddd

aaa
^[bbb$
^ccc$]
ddd

aaa
[^bbb$
^ccc]$
ddd

First two ones are symmetric, the latter two are asymmetric. Kate (both KDE 3 and 4) uses variant 2.

But -- if ccc is empty, kate uses variant 4, yet if bbb is empty it still uses 2. 

It this not consistent and it is illogical.  Kate in KDE3 used different approach, still illogical, but more natural, if ccc was empty it matched only ^bbb$ -- it was consistent with whole-line actions (like indent).

Another choice would be to behave 100% consistently -- using each time variant 4.

Honestly speaking I prefer Kate3 behaviour -- it is illogical, yet it is what users are used to (hey, it is 100% compatible with sed by design ;-D), it feels more natural.

And of course this is rather very tiny advantage but Kate3 behaviour does not require any change in selection indication.

Did I explain the problem? I hope if illogical behaviour didn't convince you, sed compatibility did :-)).

Comment 14 Maciej Pilichowski 2008-11-22 10:38:39 UTC

PS. Correction -- I am so focused on ccc line that I said incorrectly that Kate3 used worked likes this (ccc line is empty)

aaa
[^bbb$
]^$
ddd

but it works that way

aaa
^[bbb$
]^$
ddd

So just for clarification, when I say I opt for Kate3-way I really meant the first way (in this comment).

Comment 15 Matthew Woehlke 2008-11-24 17:12:49 UTC

> I could think of it as two lines and space made by kate to edit the third
> line.

Yes, that's exactly how it works :-).

> Did I explain the problem?

Sorry, either I'm not following, or else I'm not seeing the behavior you are seeing. Is it '^' or '$' that is not matching as you expect? (IMO '$' is fine, but '^' is matching 'null lines' which I think is wrong. So far that's the only consistent 'iffy' behavior I've spotted.)

Comment 16 Maciej Pilichowski 2008-11-24 20:26:10 UTC

> > Did I explain the problem?
> Sorry, either I'm not following, 

Ok, but do you understand the "syntax" I used in those examples?

Assuming "yes"...

This is the expected/wished behaviour (I can explain why this is maybe not fully logical, but sed compatible, you like, right :-) and it is really natural). Here we go:

aaa
[^bbb$
^ccc$]
ddd

aaa
[^$
^ccc$]
ddd

aaa
[^bbb$
]^ccc$
ddd

aaa
[^bbb$
]^$
ddd

For Kate (KDE4) only the first two cases work as expected. Kate didn't fully implemented this behaviour either (both Kate3 and Kate4 work differently).

The shown behaviour is consistent with current visual indication of selection.

Comment 17 Matthew Woehlke 2008-11-24 21:52:40 UTC

> do you understand the "syntax" I used in those examples? 

Honestly I am not sure ;-). But if I do, then what you are describing is the behavior I would also expect. Do you agree that the current behavior looks like this?

aaa
^bbb$
^]ccc$
ddd

aaa
[^bbb$
^]$
ddd 

(If "yes", then hopefully that means we are on the same wavelength :-).)

The behavior for the most part seems to be consistent between selection and whole-document (though I may have missed something), which is one reason I say this isn't about selection. Also, I don't think you are complaining about how /selecting things/ works (which is what I call "selection"), but rather how regex matching happens with respect to selections (though, again, it seems to have the same bug in whole-document mode). Specifically, '^' matches the not-line after a '\n' and before EOS/EOF, which (IMO, but I think we just decided we agree on this) it should not.

Put another way, replacing '^' with some text - that does not include a '\n', of course - should not change the number of lines in the file. Right now, it does!

That being the case, I've changed the description and confirmed this, though I must admit we've made enough noise I have serious thoughts about opening a fresh bug with a clearer description from the get-go.

Comment 18 Maciej Pilichowski 2008-11-24 22:11:41 UTC

> aaa
> ^bbb$
> ^]ccc$
> ddd
>
> aaa
> [^bbb$
> ^]$
> ddd
>
> (If "yes", then hopefully that means we are on the same wavelength
> :-).)

Yes, we are :-) This is the current behaviour.

> The behavior for the most part seems to be consistent between
> selection and whole-document (though I may have missed something),

Yes, with Kate interpretation of whole document (which differs from sed).

> which is one reason I say this isn't about selection. 

I see it selection -> whole document, but if you see it in reverse, no problem.

> Specifically, '^' matches the
> not-line after a '\n' and before EOS/EOF, which (IMO, but I think
> we just decided we agree on this) it should not.

Yes.

> That being the case, I've changed the description and confirmed
> this, though I must admit we've made enough noise I have serious
> thoughts about opening a fresh bug with a clearer description from
> the get-go.

Ok, if I can be any helpful, please let me know.

But! Since we agreed on selection, one important thing about it. This case

aaa
[^bbb$
]^ccc$
ddd

means in natural language -- "line ccc is not selected (at all)" (reason in technical language: if it were ^ would be matched).

Comment 19 Matthew Woehlke 2008-11-24 22:50:18 UTC

> Since we agreed on selection, one important thing about it. This case
>
> aaa
> [^bbb$
> ]^ccc$
> ddd
>
> means in natural language -- "line ccc is not selected (at all)" (reason in
> technical language: if it were ^ would be matched).

Correct. 0 characters of the line 'ccc\n' are selected => that line is not selected. This matches the whole-document behavior that there is no line between an EOF (or EOS) immediately following a '\n'.

Basically, it's the same "potential for a line" versus "an actual line" situation. There exists a potential for that line to become selected, but it isn't [selected] yet :-).

...and '^' shouldn't match "potential" lines, only actual lines (currently it does match both; herein lies the bug).

Comment 20 L_V 2010-08-23 16:04:37 UTC

Kate 4:4.5.0b-0ubuntu1~lucid1 is broken with regex.
No way to remove or replace end of lines (\n ? or another one ? $ ?)
Is there any progress since 2008-11-24 ?

Comment 21 Christoph Cullmann 2015-10-08 08:59:39 UTC

Dear user,

this wish list item is now closed, as it wasn't touched in the last two years and no contributor stepped up to implement it.

The Kate/KTextEditor team is very small and we can just try to keep up with fixing bugs. Therefore wishs that show no activity for two years or more will be closed from now on to keep at least a bit overview about 'current' wishs of the users.

If you want your feature to be implemented, please step up to provide some patch for it. If you think it is really needed, you can reopen your request, but keep in mind, if no new good arguments are made and no people get attracted to help out to implement it, it will expire in two years again.

We have a nice website kate-editor.org that provides all the information needed to contribute, please make use of it. For highlighting improvements our user manual shows how to write syntax definition files.

Greetings
Christoph