Summary: | DOCTYPE not correctly parsed | ||
---|---|---|---|
Product: | [Applications] konqueror | Reporter: | Alex Hermann <gaaf> |
Component: | khtml parsing | Assignee: | Konqueror Developers <konq-bugs> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | deletesoftware, kde |
Priority: | NOR | ||
Version: | 3.3 | ||
Target Milestone: | --- | ||
Platform: | unspecified | ||
OS: | Linux | ||
Latest Commit: | Version Fixed In: | ||
Sentry Crash Report: | |||
Attachments: |
Adds doctype node parsing
Support for parsing doctype faulty testcase Another doctype parsing support |
Description
Alex Hermann
2004-11-04 10:35:08 UTC
Hmm, if it's served as XML it works fine. But not if parsed as XHTML. Well, i can confirm that, but is that intentional? Shouldn't khmtl ignore the (server provided) MIME type and just use what's specified in the file? No. The mimetype is very important actually. Read through http://www.w3.org/TR/xhtml-media-types/ We "just" need to fix the HTML parser to ignore XML constructs. If it is served as 'application/xml' the same error occurs. The error would be less visible if khtml would put more mime types in its Accept: header. Now khtml never receives application/xhtml+xml from most servers, as the Accept: header does not contain it and a lot of servers don't serve it then. Firefox does get the xml/xhtml related mime types from those servers because of their more specific Accept: header. khtml: Accept: text/html, image/jpeg, image/png, text/*, image/*, */* firefox: Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 The problem is that XHTML is currently parsed using the HTML parser. Once fixed this problem will go away. Created attachment 24290 [details]
Adds doctype node parsing
Currently KHTML omits doctype declaration when parsing (x)html. I've done a
patch with some WebCore code which fixes that. Note that it also fixes Acid3
18th test - now konqueror passes 74 tests.
It's my first patch to KDE, so I don't know it's good or not :)
Thanks for the patch. This is actually more important than it seems due to http://bugs.kde.org/show_bug.cgi?id=153827#c13 --- the current parsing spot means that the mode switching detection can fail. Allan, could you perhaps take a look? I think the patch is a great intermediary step... It doesn't try to fully replace the previous logic, just makes parallel parsing, so it is low risk and may be immediately backported. Then we may improve upon and get read entirely of parseDocTypeDeclaration in trunk, which is simple once the tokenizing logic is in place. There is one problem when removing parseDocTypeDeclaration. It is called before parsing has began, so parse mode should be determinated not in determinateParseMode, but in parseDoctypeToken, and it'll be not known before doctype declaration. > There is one problem when removing parseDocTypeDeclaration. It is called before parsing has began, yes... determineParseMode call in KHTMLPart should be removed > so parse mode should be determinated not in determinateParseMode, but in parseDoctypeToken certainly... that, or you could call determineParseMode from there... but having all the logic in the tokenizer would be cleaner indeed. > it'll be not known before doctype declaration. right, and when you think about it, it is exactly the way it should be :) Created attachment 24349 [details]
Support for parsing doctype
Ok, I've done another patch (my previous patch should be omitted).
HTMLTokenizer::parseDoctype() code is quite nasty (adding comment support was
not easy) but it works perfectly. I also fix debugDOMTree (earlier it dumps
only firstChild of document, not whole document) and of course I moved parse
mode determining to HTMLParser::parseDoctypeToken() (I moved few lines of code
to new HTMLDocumentImpl method, modesChanged() and I don't know it's good
resolution or not).
wow, great work Gustaw... the patch is extremely nice.
I'll give it a testregression run later today - got to cleanup my baseline first.
Good catch about reimplementing the intra-doctype comments : they are only really supported by Gecko - but as most doctype parsing errors are ignored by MSIE6, the net effect is that sgml comments inside the doctype don't make the detection fail in either of those browsers, so it's great we have that in place.
> (my previous patch should be omitted)
it still looks to me like a perfect fit for the stable 4.0.x branch :)
> > (my previous patch should be omitted) > > it still looks to me like a perfect fit for the stable 4.0.x branch :) Right, but some changes from my new patch (e.g. htmltokenizer.*) should be backported too (it's comment handling + some clean). I would encourage the move of docType parsing to the tokenizer to go into 4.0.x. We have a bad bug in that we sometimes don't get enough data for it to work, and it /probably/ affects CNN.com, though I am not sure due to reproducibility problems. the regression test output is mostly OK provided the following is changed (otherwise our htmlHacks() style flag won't update) @@ -528,10 +291,8 @@ void HTMLDocumentImpl::determineParseMode( const QString &str ) else kDebug(6030) << " using transitional parseMode"; - // not sure this is needed if ( pMode != oldPMode && styleSelector() ) - recalcStyleSelector(); - + updateStyleSelector(true/*shallow*/); I see a remaining issue though. In the below testcase, the empty table cell should be rendered without borders in quirks mode, and that's not the case with your patch + the above fix. <body style=margin:0><h3> <script> if (document.compatMode) document.write( document.compatMode == "CSS1Compat" ? "strict mode":"quirks mode" ) </script> <table border=1><tr><td>a</td><td>b</td><td></td></tr><tr><td>a</td><td>b</td><td>c</td></tr></table> Maksim: mmh, right, I'd have preferred we fix the bug instead though. It's going to hit us back at some point ;-/ Well, Allan pointed out that doing the parseMode detection in the tokenizer is the right approach, and I agree; and that'd remove the entire dependency on the buffer size, hence fixing the non-crash portion of #153827. Reading the patch it seems to do that, so... > I see a remaining issue though.
> In the below testcase, the empty table cell should be rendered without borders
> in quirks mode, and that's not the case with your patch + the above fix.
In HTMLDocumentImpl::determinateParseMode() we should call modesChanged() if m_doctype is null (with old parse mode like in HTMLParser::parseDoctypeToken()).
In other words: @@ -266,8 +266,10 @@ void HTMLDocumentImpl::determineParseMod { if (m_doctype == 0) { // Currently we haven't got any doctype, so default to quirks mode and Html4 + ParseMode oldPMode = pMode; pMode = Compat; hMode = Html4; + modesChanged(oldPMode); } } This causes testcase be rendrered good. Re #18: yes, nice, it does seem to make testregression happy :) there are two new things I noticed however, that maybe the same bug: a) some sort of race condition on a few testcases, that make it sometime ignore the strict mode on first load See for instance parser/compatmode_xhtml_strict.xhtml in the tests (I'll attach it here so you don't have to checkout the whole test suite) b) I discovered the konqueror information page looks seriously broken (Help->Konqueror Introduction). Very strange. Can you reproduce that? Created attachment 24415 [details] faulty testcase baseline output is: "IMG: 3 elements found (3 in Mozilla-1.6 - Mozilla BUG 264609) img: 3 elements found tagname: IMG CSS1Compat (CSS1Compat in Mozilla) No blue background, due to strict mode" now it will sometime (on forced reload, putting cursor in address bar and pressing enter) display as: "IMG: 0 elements found (3 in Mozilla-1.6 - Mozilla BUG 264609) img: 3 elements found tagname: img CSS1Compat (CSS1Compat in Mozilla) No blue background, due to strict mode" > a) some sort of race condition on a few testcases, that make it sometime ignore the strict mode on first load > See for instance parser/compatmode_xhtml_strict.xhtml in the tests (I'll attach it here so you don't have to checkout the whole test suite) I can't reproduce that. I reload this testcase page, using refresh button or pressing enter in address bar, and it's always displayed correctly. I've even downloaded page and tried open it locally - nothing changes. > now it will sometime (on forced reload, putting cursor in address bar and > pressing enter) display as: > > "IMG: 0 elements found (3 in Mozilla-1.6 - Mozilla BUG 264609) > img: 3 elements found > tagname: img > CSS1Compat (CSS1Compat in Mozilla) > > No blue background, due to strict mode" It's now parsed as xhtml... But why? > b) I discovered the konqueror information page looks seriously broken (Help->Konqueror Introduction). Very strange. Can you reproduce that? Yes... But I really don't know why it's rendered wrong... It's very complicated... Some parts of code in KHTMLPart call determineParseMode before parsing the document, but some not. I don't know how to fix it... Maybe we should delete determineParseMode at all? When I remove doctype from $KDEDIR/share/apps/konqeror/about/launch.html then information page is rendered correctly... Without my patch determineParseMode is not called at all, so all stuff in modesChanged is omitted (and maybe it should be omitted when parsing information page?). Created attachment 24515 [details]
Another doctype parsing support
I've done a patch that fixes konqueror information page wrong rendering. But I
still don't know why KHTMLPart::write(const char*, int) calls
determineParseMode, and KHTMLPart::write(const QString&) assumes strict
parsemode without calling determineParseMode...
sorry to get back to you only now.. I have been fairly busy and couldn't run the regression suite for your latest patch.
Its currently running. If the output is satisfying, I'll commit your patch this evening.
> But I still don't know why KHTMLPart::write(const char*, int) calls
> determineParseMode, and KHTMLPart::write(const QString&) assumes strict
> parsemode without calling determineParseMode...
it looks like an 8 year old misguided decision nobody ever complained about. Will need to add a write(const QString&, bool honourDoctype) method to allow changing that behaviour without breaking backward compatibility.
SVN commit 810117 by ggarand: Full parsing of doctype by the HTML Tokenizer patch by Gustaw Smolarczy <kwielkiegie at gmail dt com>, improving upon a webkit patch slightly modified to provide a turnaround for broken behaviour of KHTMLPart::write() noticed by Gustaw. Fixing Acid 3's 18th test. BUG: 92670 CCBUG: 156947 M +14 -247 html/html_documentimpl.cpp M +5 -1 html/html_documentimpl.h M +76 -0 html/htmlparser.cpp M +6 -0 html/htmlparser.h M +324 -3 html/htmltokenizer.cpp M +58 -0 html/htmltokenizer.h M +19 -9 khtml_part.cpp M +21 -4 khtml_part.h M +2 -0 khtmlpart_p.h M +5 -18 xml/dom_docimpl.cpp M +2 -1 xml/dom_docimpl.h WebSVN link: http://websvn.kde.org/?view=rev&revision=810117 |