504749 – "<" counts as start of tag, so colours following text, regardless of the text that follows

Bug 504749 - "<" counts as start of tag, so colours following text, regardless of the text that follows

Summary: "<" counts as start of tag, so colours following text, regardless of the text...

Status:	RESOLVED FIXED

Alias:	None

Product:	lokalize
Classification:	Applications
Component:	editor (other bugs)
Version First Reported In:	unspecified
Platform:	Other Linux

Importance:	NOR normal
Target Milestone:	---
Assignee:	Simon Depiets

URL:
Keywords:

Depends on:
Blocks:

Reported:	2025-05-24 22:52 UTC by Finley Watson
Modified:	2025-07-09 13:57 UTC (History)
CC List:	2 users (show)

See Also:
Latest Commit:	https://invent.kde.org/sdk/lokalize/-/commit/c05bb1d914725d894c4a60c5c159375f1e6d94b2
Version Fixed/Implemented In:
Sentry Crash Report:

Attachments
Lokalize editor window showing translation source and target, with a <= incorrectly marked as the start of a tag, and all text following that coloured as a tag (114.78 KB, image/png) 2025-05-24 22:52 UTC, Finley Watson	Details
View All Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Finley Watson 2025-05-24 22:52:27 UTC

Created attachment 181718 [details]
Lokalize editor window showing translation source and target, with a <= incorrectly marked as the start of a tag, and all text following that coloured as a tag

I realise this is not an easy task to fix, but we have (html, etc) tag colouring in the editor panes in Lokalize. The "<" character counts as the beginning of a tag.

However it is also a valid char in normal text you might translate (example situation included here). In these cases text can be coloured as a tag without it being a real or valid tag.

One way to fix this being more clever with the string parsing: after < there (very probably) should only ever be a char matching [a-zA-Z] so for example < em > is not a valid html tag while <em> is. We also could look at giving up on marking a section of the string as a tag if we have a < and > mismatch i.e. walk the string looking for the matching > tag, if it's not found by the end of the string, or another < is found, reject that first < and don't colour it as tag.

Comment 1 Bug Janitor Service 2025-07-02 22:47:37 UTC

A possibly relevant merge request was started @ https://invent.kde.org/sdk/lokalize/-/merge_requests/246

Comment 2 Finley Watson 2025-07-09 10:23:07 UTC

Git commit ee40ecd67e3ba95ab75aac433a81663f958e2d6f by Finley Watson.
Committed on 09/07/2025 at 10:23.
Pushed by finw into branch 'master'.

Improve HTML tag matching to reduce the false-positive colouring

By searching for tags with regex, we can reduce the number of false-
positive matches where text in the translation source / target is
coloured as though it were a HTML tag when it isn't. Previously, any
text with a `<` char in it would be coloured as HTML from that char,
now only from the `<` char followed by an alphabetic char,
potentially with a `/` char between i.e. matches `<s` in `<strong>`
and `</s` in `</strong>` but not something like `< b` or `<= 3`. In
my experience this is in line with how web engines parse HTML files.

Before:

![image](/uploads/5b7909579002204a6a94f3e798928831/image.png){width=675 height=590}

After:

![image](/uploads/ca5dfdde0decc98603231d41807439a7/image.png){width=681 height=557}

Original HTML tag highlighting is not changed:

![image](/uploads/910835acca0f003710547f87d6cc8585/image.png){width=232 height=92}

M  +5    -2    src/syntaxhighlighter.cpp

https://invent.kde.org/sdk/lokalize/-/commit/ee40ecd67e3ba95ab75aac433a81663f958e2d6f

Comment 3 Finley Watson 2025-07-09 13:57:32 UTC

Git commit c05bb1d914725d894c4a60c5c159375f1e6d94b2 by Finley Watson.
Committed on 09/07/2025 at 10:23.
Pushed by finw into branch 'release/25.08'.

Improve HTML tag matching to reduce the false-positive colouring

By searching for tags with regex, we can reduce the number of false-
positive matches where text in the translation source / target is
coloured as though it were a HTML tag when it isn't. Previously, any
text with a `<` char in it would be coloured as HTML from that char,
now only from the `<` char followed by an alphabetic char,
potentially with a `/` char between i.e. matches `<s` in `<strong>`
and `</s` in `</strong>` but not something like `< b` or `<= 3`. In
my experience this is in line with how web engines parse HTML files.

Before:

![image](/uploads/5b7909579002204a6a94f3e798928831/image.png){width=675 height=590}

After:

![image](/uploads/ca5dfdde0decc98603231d41807439a7/image.png){width=681 height=557}

Original HTML tag highlighting is not changed:

![image](/uploads/910835acca0f003710547f87d6cc8585/image.png){width=232 height=92}
(cherry picked from commit ee40ecd67e3ba95ab75aac433a81663f958e2d6f)

M  +5    -2    src/syntaxhighlighter.cpp

https://invent.kde.org/sdk/lokalize/-/commit/c05bb1d914725d894c4a60c5c159375f1e6d94b2