Bug 83284

Summary:	Konqueror chokes on <!-- without //--> [testcase]
Product:	[Applications] konqueror	Reporter:	Dik Takken <kde>
Component:	khtml parsing	Assignee:	Konqueror Bugs <konqueror-bugs-null>
Status:	RESOLVED FIXED
Severity:	normal	CC:	aiacovitti, germain, kde
Priority:	NOR
Version First Reported In:	unspecified
Target Milestone:	---
Platform:	Compiled Sources
OS:	Linux
Latest Commit:		Version Fixed In:
Sentry Crash Report:
Attachments:	Testcase regression testcase OT patch, ignore

Description Dik Takken 2004-06-13 00:37:55 UTC

Version:            (using KDE KDE 3.2.2)
Installed from:    Compiled From Sources
OS:                Linux

The attached testcase shows strange behavior when no closing //--> is present. I don't know if this is supposed to work, someone with access to Mozilla and IE should their behavior first.

Comment 1 Dik Takken 2004-06-13 00:38:39 UTC

Created attachment 6338 [details]
Testcase

Comment 2 Timothy D. Ginn 2004-06-23 22:28:44 UTC

The Mozilla behaviour is to just display "Test".

The Konqueror behaviour is to display:
Test " document.write(content);

Comment 3 Dik Takken 2004-07-23 13:49:09 UTC

Problem still present on KDE 3.3 Beta 1.

Comment 4 Dik Takken 2005-12-10 20:31:50 UTC

No improvements in KDE 3.5.

Comment 5 Allan Sandfeld 2006-10-29 21:17:22 UTC

Nasty..

Okay the "right" thing is to ignore comments inside the <script> tag. This is what Firefox does.

MSIE on the other hand parses comments inside scripts in some weird manor. Basically:
<script>
<!--
  myjscode()
</script>
-->

Will not execute anything in either MSIE or opera, but if you don't close the comments it will.

Comment 6 Michael Leupold 2008-04-06 08:49:32 UTC

Confirmed on trunk r794020.
Not sure what the behaviour is supposed to be.

Comment 7 Allan Sandfeld 2009-12-11 18:47:19 UTC

The defacto standard behavior depends on parsing the entire document.

Essentially: If a comment is never closed, the parser moves back and autocloses the comment immediately, and then reparses the rest of the document.  We already have something similar for some other cases.

Comment 8 Jaime Torres 2009-12-15 21:34:50 UTC

SVN commit 1062741 by jtamate:

BUG: 83284
BUG: 161409

&lt;-- is no longer a comment start.

http://reviewboard.kde.org/r/2358/

 M  +1 -1      htmltokenizer.cpp  


WebSVN link: http://websvn.kde.org/?view=rev&revision=1062741

Comment 9 Andrea Iacovitti 2009-12-18 15:38:23 UTC

Created attachment 39152 [details]
regression testcase

I observed a regression due to revision 1062741.
Attached a reduced testcase.
Have to open a new bug report?

Comment 10 Germain Garand 2010-01-05 03:58:47 UTC

as noted by Andrea, this patch is bound to cause regressions unfortunately.

IIUC, it is based on the assumption that the comment detection code was wrongly looking for '<!-' instead of '<!--'.

But that is not what the original code was doing : if you may observe, the current character is compared to '-' too, in addition to the buffer, so as to have the correct '<!--' start sequence. The patch makes that '<!---' and thus unwillingly disables comment detection within some special tags.

I believe the original problem is the tokenizer is trying to do some half-assed second-level tokenizing within some tags (title, script, etc) to try to detect comments in unusual situations.
But it does that naïvely, without switching to any special state, so it is blind to any ordinary entity escaping.

Comment 11 Germain Garand 2010-01-06 05:55:40 UTC

SVN commit 1070587 by ggarand:

change r1062741 to rather skip comment detection in |title| altogether,
in order to avoid some regressions in |script| tag parsing.

This should be functionally equivalent (see comments on #83284)
and is actually closer to the behaviour of Gecko.

CCBUG: 83284
CCBUG: 161409

 M  +1 -1      htmltokenizer.cpp  


WebSVN link: http://websvn.kde.org/?view=rev&revision=1070587

Comment 12 Germain Garand 2010-01-06 07:16:30 UTC

Some notes : it turns out the lax tokenizing of comments in parseSpecial was on purpose.

<title> and some other have a special legacy and must absorb anything until more or less properly closed.

Consider:
data:text/html,<title>some <silly &lt;&gt;> demonstration<!-- this won't be parsed as a comment node --></title <div>Lorem

As you can see, the title representation of this mess is very interoperable between browsers as:

"some <silly <>> demonstration<!-- this won't be parsed as a comment node -->"

Anyway, as the result of comment tokenizing was discarded in the context of title, it is indeed alright to skip the comment detection for it, as did the original patch by accident.

---
Note to self : need to make a proper ecma wrapper object for comments. Right now they are [object CharacterImp] instead of [object Comment].

Comment 13 Germain Garand 2010-01-19 21:34:57 UTC

while I still think about that : this fix still isn't right.

data:text/html,<title>testcase <!-- </title> --> failed, not in title!</title>

must keep track of last encountered entity when checking for comment start.

> ---
> Note to self : need to make a proper ecma wrapper object for comments. Right
> now they are [object CharacterImp] instead of [object Comment].
>

oh, that too... 
@Maksim : is the (to be attached) correct or is there a simpler way to do that? I couldn't seem to make it more minimal, even though the interface for Comment is empty...

Comment 14 Germain Garand 2010-01-19 21:36:46 UTC

Created attachment 40057 [details]
OT patch, ignore

Comment 15 Germain Garand 2010-01-22 22:13:00 UTC

SVN commit 1078786 by ggarand:

fix many issues in the parser/tokenizer:

.correctly fix tricky issue mentioned in #83284 /wrt to wrong tokenizing of
 lt entity + '!--' as a comment start, by tracking if we are far enough
 from the last encountered entity to be able to look back with confidence
 (also fix similar issue with end-tags:
     <title>&lt;title>should be in title</title>)
.fix crash in some instances of <script .../> due to |script| bool member
 not being set (and thus triggering wrong code path in notifyFinished)
.remove handling of unclosed script elements with 'src' attribute, as it is
 no longer compatible with other UAs and prevents us from correctly
 handling broken comments in |script| elements' content (WONTFIX #58420).
.fix handling of unclosed comments in script elements(#83284 per se -> #c1)
.fix handling of unclosed title element:
  <title>One<p>Two.
.rename |parseSpecial| to more telling |parseRawContent|, along with all
 its dependent members ; add some comments explaining what's
 going on, as well as the purpose of the finish() hackery

CCBUG: 83284
CCBUG: 58420

 M  +3 -0      htmlparser.cpp  
 M  +116 -104  htmltokenizer.cpp  
 M  +15 -15    htmltokenizer.h  


WebSVN link: http://websvn.kde.org/?view=rev&revision=1078786