Bug 92670

Summary: DOCTYPE not correctly parsed
Product: [Applications] konqueror Reporter: Alex Hermann <gaaf>
Component: khtml parsingAssignee: Konqueror Developers <konq-bugs>
Status: RESOLVED FIXED    
Severity: normal CC: deletesoftware, kde
Priority: NOR    
Version: 3.3   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:
Attachments: Adds doctype node parsing
Support for parsing doctype
faulty testcase
Another doctype parsing support

Description Alex Hermann 2004-11-04 10:35:08 UTC
Version:           3.3 (using KDE 3.3.1,  (3.1))
Compiler:          gcc version 3.3.5 (Debian 1:3.3.5-2)
OS:                Linux (i686) release 2.6.9

The following document is not parsed correctly:

--------
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/tr/xhtml11/DTD/xhtml11.dtd"
  [ <!ATTLIST a target CDATA #IMPLIED> ]
>

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="nl">

<head>
<title>Test DocType</title>
</head>

<body>
<p>Test link: <a href="http://www.google.com" target="_blank">Google</a></p>

</body>

</html>
---------

Before the 'Test...' text, a line with '] >' is printed, which shouldn't. The 'target' attribute for the 'a' tag is working as expected.

Mozilla and Opera parse/render this correctly.

(The testdocument is available on: http://home.student.utwente.nl/a.hermann/testdoctype.html )


Alex.
Comment 1 Stephan Kulow 2004-11-04 13:52:47 UTC
Hmm, if it's served as XML it works fine. But not if parsed as XHTML.
Comment 2 Alex Hermann 2004-11-04 14:17:57 UTC
Well, i can confirm that, but is that intentional? Shouldn't khmtl ignore the (server provided) MIME type and just use what's specified in the file?
Comment 3 Stephan Kulow 2004-11-04 16:10:02 UTC
No. The mimetype is very important actually. Read through http://www.w3.org/TR/xhtml-media-types/

We "just" need to fix the HTML parser to ignore XML constructs.
Comment 4 Alex Hermann 2004-11-04 21:24:01 UTC
If it is served as 'application/xml' the same error occurs.

The error would be less visible if khtml would put more mime types in its Accept: header. Now khtml never receives application/xhtml+xml from most servers, as the Accept: header does not contain it and a lot of servers don't serve it then. Firefox does get the xml/xhtml related mime types from those servers because of their more specific Accept: header.

khtml:
Accept: text/html, image/jpeg, image/png, text/*, image/*, */*

firefox:
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

Comment 5 Allan Sandfeld 2006-06-16 14:15:25 UTC
The problem is that XHTML is currently parsed using the HTML parser. Once fixed this problem will go away.
Comment 6 Gustaw Smolarczyk 2008-04-10 18:58:00 UTC
Created attachment 24290 [details]
Adds doctype node parsing

Currently KHTML omits doctype declaration when parsing (x)html. I've done a
patch with some WebCore code which fixes that. Note that it also fixes Acid3
18th test - now konqueror passes 74 tests.

It's my first patch to KDE, so I don't know it's good or not :)
Comment 7 Maksim Orlovich 2008-04-10 19:28:00 UTC
Thanks for the patch. This is actually more important than it seems due to http://bugs.kde.org/show_bug.cgi?id=153827#c13 --- the current parsing spot means that the mode switching detection can fail. Allan, could you perhaps take a look?
Comment 8 Germain Garand 2008-04-15 14:16:35 UTC
I think the patch is a great intermediary step... It doesn't try to fully replace the previous logic, just makes parallel parsing, so it is low risk and may be immediately backported.

Then we may improve upon and get read entirely of parseDocTypeDeclaration in trunk, which is simple once the tokenizing logic is in place.

Comment 9 Gustaw Smolarczyk 2008-04-15 15:33:38 UTC
There is one problem when removing parseDocTypeDeclaration. It is called before parsing has began, so parse mode should be determinated not in determinateParseMode, but in parseDoctypeToken, and it'll be not known before doctype declaration.
Comment 10 Germain Garand 2008-04-15 16:59:04 UTC
> There is one problem when removing parseDocTypeDeclaration. It is called before parsing has began, 

yes... determineParseMode call in KHTMLPart should be removed 

> so parse mode should be determinated not in determinateParseMode, but in parseDoctypeToken

certainly... that, or you could call determineParseMode from there... but having all the logic in the tokenizer would be cleaner indeed.

> it'll be not known before doctype declaration.

right, and when you think about it, it is exactly the way it should be :)
Comment 11 Gustaw Smolarczyk 2008-04-15 23:01:40 UTC
Created attachment 24349 [details]
Support for parsing doctype

Ok, I've done another patch (my previous patch should be omitted).
HTMLTokenizer::parseDoctype() code is quite nasty (adding comment support was
not easy) but it works perfectly. I also fix debugDOMTree (earlier it dumps
only firstChild of document, not whole document) and of course I moved parse
mode determining to HTMLParser::parseDoctypeToken() (I moved few lines of code
to new HTMLDocumentImpl method, modesChanged() and I don't know it's good
resolution or not).
Comment 12 Germain Garand 2008-04-16 15:52:18 UTC
wow, great work Gustaw... the patch is extremely nice.
I'll give it a testregression run later today - got to cleanup my baseline first.

Good catch about reimplementing the intra-doctype comments : they are only really supported by Gecko - but as most doctype parsing errors are ignored by MSIE6, the net effect is that sgml comments inside the doctype don't make the detection fail in either of those browsers, so it's great we have that in place.

> (my previous patch should be omitted)

it still looks to me like a perfect fit for the stable 4.0.x branch :)
Comment 13 Gustaw Smolarczyk 2008-04-16 21:52:12 UTC
> > (my previous patch should be omitted) 
>
> it still looks to me like a perfect fit for the stable 4.0.x branch :) 

Right, but some changes from my new patch (e.g. htmltokenizer.*) should be backported too (it's comment handling + some clean).
Comment 14 Maksim Orlovich 2008-04-16 22:18:51 UTC
I would encourage the move of docType parsing to the tokenizer to go into 4.0.x. We have a bad bug in that we sometimes don't get enough data for it to work, and it /probably/ affects CNN.com, though I am not sure due to reproducibility problems.
Comment 15 Germain Garand 2008-04-17 19:47:45 UTC
the regression test output is mostly OK provided the
following is changed (otherwise our htmlHacks() style flag won't update)

@@ -528,10 +291,8 @@ void HTMLDocumentImpl::determineParseMode( const QString &str )
     else
         kDebug(6030) << " using transitional parseMode";

-    // not sure this is needed
     if ( pMode != oldPMode && styleSelector() )
-        recalcStyleSelector();
-
+        updateStyleSelector(true/*shallow*/);


I see a remaining issue though.
In the below testcase, the empty table cell should be rendered without borders
in quirks mode, and that's not the case with your patch + the above fix.

<body style=margin:0><h3>
<script>
if (document.compatMode)
    document.write( document.compatMode == "CSS1Compat" ? "strict mode":"quirks mode" )
</script>
<table border=1><tr><td>a</td><td>b</td><td></td></tr><tr><td>a</td><td>b</td><td>c</td></tr></table>

Maksim: mmh, right, I'd have preferred we fix the bug instead though. It's going to hit us back at some point ;-/
Comment 16 Maksim Orlovich 2008-04-17 20:04:09 UTC
Well, Allan pointed out that doing the parseMode detection in the tokenizer is the right approach, and I agree; and that'd remove the entire dependency on the buffer size, hence fixing the non-crash portion of #153827. Reading the patch it seems to do that, so...
Comment 17 Gustaw Smolarczyk 2008-04-18 15:20:51 UTC
> I see a remaining issue though.
> In the below testcase, the empty table cell should be rendered without borders
> in quirks mode, and that's not the case with your patch + the above fix.

In HTMLDocumentImpl::determinateParseMode() we should call modesChanged() if m_doctype is null (with old parse mode like in HTMLParser::parseDoctypeToken()).
Comment 18 Gustaw Smolarczyk 2008-04-19 11:22:11 UTC
In other words:

@@ -266,8 +266,10 @@ void HTMLDocumentImpl::determineParseMod
 {
     if (m_doctype == 0) {
         // Currently we haven't got any doctype, so default to quirks mode and Html4
+        ParseMode oldPMode = pMode;
         pMode = Compat;
         hMode = Html4;
+        modesChanged(oldPMode);
     }
 }
 

This causes testcase be rendrered good.
Comment 19 Germain Garand 2008-04-19 22:15:30 UTC
Re #18: yes, nice, it does seem to make testregression happy :)

there are two new things I noticed however, that maybe the same bug:

a) some sort of race condition on a few testcases, that make it sometime ignore the strict mode on first load
See for instance parser/compatmode_xhtml_strict.xhtml in the tests (I'll attach it here so you don't have to checkout the whole test suite)

b) I discovered the konqueror information page looks seriously broken (Help->Konqueror Introduction). Very strange. Can you reproduce that?
Comment 20 Germain Garand 2008-04-19 22:21:20 UTC
Created attachment 24415 [details]
faulty testcase

baseline output is:

"IMG: 3 elements found (3 in Mozilla-1.6 - Mozilla BUG 264609)
 img: 3 elements found
 tagname: IMG
 CSS1Compat (CSS1Compat in Mozilla)
 
 No blue background, due to strict mode"

now it will sometime (on forced reload, putting cursor in address bar and
pressing enter) display as:

"IMG: 0 elements found (3 in Mozilla-1.6 - Mozilla BUG 264609)
 img: 3 elements found
 tagname: img
 CSS1Compat (CSS1Compat in Mozilla)
 
 No blue background, due to strict mode"
Comment 21 Gustaw Smolarczyk 2008-04-19 22:55:04 UTC
> a) some sort of race condition on a few testcases, that make it sometime ignore the strict mode on first load
> See for instance parser/compatmode_xhtml_strict.xhtml in the tests (I'll attach it here so you don't have to checkout the whole test suite)

I can't reproduce that. I reload this testcase page, using refresh button or pressing enter in address bar, and it's always displayed correctly. I've even downloaded page and tried open it locally - nothing changes.

> now it will sometime (on forced reload, putting cursor in address bar and
> pressing enter) display as:
> 
> "IMG: 0 elements found (3 in Mozilla-1.6 - Mozilla BUG 264609)
>  img: 3 elements found
>  tagname: img
>  CSS1Compat (CSS1Compat in Mozilla)
>   
>  No blue background, due to strict mode"

It's now parsed as xhtml... But why?

> b) I discovered the konqueror information page looks seriously broken (Help->Konqueror Introduction). Very strange. Can you reproduce that?

Yes... But I really don't know why it's rendered wrong...
Comment 22 Gustaw Smolarczyk 2008-04-20 00:17:24 UTC
It's very complicated... Some parts of code in KHTMLPart call determineParseMode before parsing the document, but some not. I don't know how to fix it... Maybe we should delete determineParseMode at all?

When I remove doctype from $KDEDIR/share/apps/konqeror/about/launch.html then information page is rendered correctly... Without my patch determineParseMode is not called at all, so all stuff in modesChanged is omitted (and maybe it should be omitted when parsing information page?).
Comment 23 Gustaw Smolarczyk 2008-04-26 14:11:37 UTC
Created attachment 24515 [details]
Another doctype parsing support

I've done a patch that fixes konqueror information page wrong rendering. But I
still don't know why KHTMLPart::write(const char*, int) calls
determineParseMode, and KHTMLPart::write(const QString&) assumes strict
parsemode without calling determineParseMode...
Comment 24 Germain Garand 2008-05-19 20:49:50 UTC
sorry to get back to you only now.. I have been fairly busy and couldn't run the regression suite for your latest patch.
Its currently running. If the output is satisfying, I'll commit your patch this evening.

>  But I still don't know why KHTMLPart::write(const char*, int) calls 
> determineParseMode, and KHTMLPart::write(const QString&) assumes strict
> parsemode without calling determineParseMode... 

it looks like an 8 year old misguided decision nobody ever complained about. Will need to add a write(const QString&, bool honourDoctype) method to allow changing that behaviour without breaking backward compatibility.

Comment 25 Germain Garand 2008-05-20 05:11:05 UTC
SVN commit 810117 by ggarand:

Full parsing of doctype by the HTML Tokenizer

patch by Gustaw Smolarczy <kwielkiegie at gmail dt com>,
improving upon a webkit patch

slightly modified to provide a turnaround for broken behaviour of
KHTMLPart::write() noticed by Gustaw.

Fixing Acid 3's 18th test.

BUG: 92670
CCBUG: 156947


 M  +14 -247   html/html_documentimpl.cpp  
 M  +5 -1      html/html_documentimpl.h  
 M  +76 -0     html/htmlparser.cpp  
 M  +6 -0      html/htmlparser.h  
 M  +324 -3    html/htmltokenizer.cpp  
 M  +58 -0     html/htmltokenizer.h  
 M  +19 -9     khtml_part.cpp  
 M  +21 -4     khtml_part.h  
 M  +2 -0      khtmlpart_p.h  
 M  +5 -18     xml/dom_docimpl.cpp  
 M  +2 -1      xml/dom_docimpl.h  


WebSVN link: http://websvn.kde.org/?view=rev&revision=810117