Bug 468484

Summary:	request: spellchecking to be HTML aware
Product:	[Frameworks and Libraries] frameworks-syntax-highlighting	Reporter:	Richard Neill <kde>
Component:	syntax	Assignee:	KWrite Developers <kwrite-bugs-null>
Status:	CONFIRMED ---
Severity:	wishlist	CC:	christoph, jonathan.poelen, walter.von.entferndt
Priority:	NOR
Version First Reported In:	unspecified
Target Milestone:	---
Platform:	Ubuntu
OS:	Linux
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:

Description Richard Neill 2023-04-13 23:48:19 UTC

When writing an HTML or PHP page, it would be great if the Spellchecker would check the English language bits, but not complain about the tags and class-names, and should be entity-aware. 

So for example:
    <h2 >grey&nbsp;elephant</h2>
will flag up "h2" and "nbsp" as wrong words. 
whereas I think it should only check the words "grey" and "elephant".

Incidentally, I saw that bug #321593 is considered too complex to implement, but perhaps we could at least handle the (relatively) common case of web programming, and get the low-hanging-fruit, teaching it to ignore   <[^>]+>  and   &[:alnum:]+;

Thanks :-)

Comment 1 Christoph Cullmann 2023-04-14 19:14:31 UTC

One needs to mark the stuff that shall not be spell checked in the syntax definition.

Comment 2 Jonathan Poelen 2023-07-10 00:59:15 UTC

This has been done since 2016 and on my side h2 and nbsp are ignored. I don't understand, which editor are you using? I tried with Kate.

Comment 3 Richard Neill 2023-07-10 12:47:32 UTC

Hi - and thanks for your reply. 
I've realised what's happening - there is an issue here, but it's not quite what I thought it was.

Test case 1. 
Create a file with a .php extension, and (without entering php mode, with <?php), just have this line:

--- BEGIN ---
<h5>This is a non&eacute;breakzing title <a href='http://examplze.com/'>link</a> </h5>
--- END ---

This is handled correctly, identifying the misspelled 'breakzing' but nothing else. 


Test case 2.
In the same file, enter PHP mode and echo it.

--- BEGIN ---
<?php 
echo "<h5>This is a non&nbsp;breakzing title with a $var[keytzypo] embedded and a <a href='http://examplze.com/'>link</a> </h5>";
?>
--- END ---

In this case, the spellchecker triggers on:
*  h5  - this is a legal tag
*  nbsp - this is a legal entity
*  breakzing - this IS a typo which we wanted to find - correct behaviour.
*  keytzypo - this is an array-key - shouldn't trigger. (normal, non-array variables are OK).
 * examplze - part of an URL, shouldn't trigger.
* h5 - again.


=> So, this bug report should really be about checking, when it is within a quoted-string in PHP. 

This way of writing code is so common (i.e. switching in and out of PHP-mode by using echo, rather than with ?>...<php)  that I didn't notice the echo was a critical part of the bug-report - and KWrite's normal helpful behaviour of highlighting multiple instances of the same string meant that, when it highlighted the h5 within the echo, it also highlights the one in the normal html, which may be why I didn't. My error - sorry. 

* The behaviour is the same in KWrite and in Kate (as we would expect).

* The same behaviour occurs for any way of quoting a string: single, double, or heredoc.

* A minor unrelated point I spotted: in HTML mode, Kate/KWrite correctly ignores everything in a tag. But perhaps it should check title and alt attributes.

Thanks for your help.

Comment 4 Jonathan Poelen 2023-07-12 09:08:01 UTC

I understand better, but it's a rather complicated thing to do with undesirable effects.

For example, in the simple case of checking title and alt in HTML, this involves adding a "color" that would have the particularity of being verifiable.

<img class="..." alt="bla bla">
with
<img is Element
class= is Attribute
"..." is Value
alt= is Attribute
"bla bla" is spellCheckableValue

Value and spellCheckableValue are in fact values, but the syntax must expose 2 distinct values, which implies modifying 2 colors if you want to change the color of the attributes. Although this can be done, I think the resulting behavior is strange from the user's point of view.

The same goes for PHP with String, Heredoc, Nowdoc which, in addition to creating false positives, is a real pain when you have to juggle with a lot of interleave highlights.

A plugin could do it, but I'm thinking that it would end up duplicating a lot of code, both for the one that handles spell checking and the one that detects useful parts of the syntax.

I'm wondering whether it wouldn't be possible to add alternative syntaxes dedicated to spell checking, ideally with a way of selecting them when checking is activated. Since this kind of "syntax" eliminates the need for highlighting, the detection of language elements could be simplified.