Bug 135505 - CSS attribute selectors should be case-insensitive in a HTML document
Summary: CSS attribute selectors should be case-insensitive in a HTML document
Status: RESOLVED FIXED
Alias: None
Product: konqueror
Classification: Applications
Component: khtml parsing (show other bugs)
Version: unspecified
Platform: Fedora RPMs Linux
: NOR normal
Target Milestone: ---
Assignee: Konqueror Developers
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-10-12 13:55 UTC by Niels Leenheer (rakaz)
Modified: 2006-10-20 22:26 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
testing some attributes - quirkmode (915 bytes, text/html)
2006-10-13 00:41 UTC, Germain Garand
Details
testing some attributes - strict mode (1006 bytes, text/html)
2006-10-13 00:49 UTC, Germain Garand
Details
patch (3.45 KB, patch)
2006-10-17 03:09 UTC, Germain Garand
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Niels Leenheer (rakaz) 2006-10-12 13:55:14 UTC
Version:            (using KDE KDE 3.5.4)
Installed from:    Fedora RPMs
OS:                Linux

The following CSS attribute selectors are currently case-sensitive: 

- E[attribute~=value], 
- E[attribute!=value], 
- E[attribute^=value], 
- E[attribute$=value] and 
- E[attribute*=value]

This is not the correct behavoir for plain HTML documents. 
In HTML documents the value of most attribute are case-insensitive.

Testcases: 
http://www.css3.info/selectors-test/
http://www.css3.info/selectors-test/test-attribute-space.html
http://www.css3.info/selectors-test/test-attribute-hyphen.html
http://www.css3.info/selectors-test/test-attribute-begin.html
http://www.css3.info/selectors-test/test-attribute-end.html
http://www.css3.info/selectors-test/test-attribute-contains.html
Comment 1 Allan Sandfeld 2006-10-12 14:35:16 UTC
If true this is a regression. We have the code to select between case sensitivity. 
Comment 2 Niels Leenheer (rakaz) 2006-10-12 15:21:52 UTC
I looked at the code for 3.5.4 yesterday and I did notice there is support for selecting between case sensitivity. For the E[attribute=value] selector this works properly. I'm not sure if it was present for the hyphen and space selectors. These could be a regression.

For the begin, end and contains selectors I am sure it was not used at all. These last three use a simple QString::startsWith, QString::endsWith and a QString::contains check. With QT3 these are all case-sensity. The ability to choose whether it should be case sensitive or not was added in QT4.
Comment 3 Germain Garand 2006-10-12 16:12:51 UTC
Though it's true we aren't case insensitive at all for some exotic selectors (and that should be fixed), we would need to check again what the specification exactly says with regard to case significance of an attribute's value in HTML.

All the failing tests I can see state:
"The CSS selector should match the HTML fragment because the case of the value should not matter in a HTML document "

Well, this is not true for the class attribute, which value is case sensitive in HTML in any non-loose mode in any browser, as you can verify:

data:text/html,
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<style>
  .green { color:green; font-weight: bold;}
</style>
<div class=GREEN>Not so green..

(in non-css-impaired browsers, you can of course replace .green with [class~="green"] with identical results)

So I'm not quite sure who is right or if the class attribute is special cased for compatibility reasons...
Comment 4 Maksim Orlovich 2006-10-12 16:17:20 UTC
The implementation of the class attribute is actually -partly- case insensitive, ditto for list. Ick.

Comment 5 Maksim Orlovich 2006-10-12 16:19:51 UTC
Actually, no, it's fully case insensitive, missed the strictParsing argument to find...
Comment 6 Maksim Orlovich 2006-10-12 16:32:41 UTC
Re-tested: @Germain: this is a strict vs. quirks mode thing for HTML. In quirks classname (and ID, etc.) act case-insensitively. Hence the dependence on the strictParsing flag
Comment 7 Niels Leenheer (rakaz) 2006-10-12 23:19:11 UTC
I believe there are three different modes:

a) Quirks mode 
Treat everything case-insensitive

b) Strict mode
All of the attribute should be considered as case-insensitive, with the exception of the following:

- title
- name
- content
- scheme
- id
- class
- datetime
- summary
- headers
- abbr
- standby
- code
- object
- alt
- value
- label
- prompt
- for

c) XHTML documents
Treat everything case-sensitive.
Comment 8 Germain Garand 2006-10-13 00:41:08 UTC
Created attachment 18100 [details]
testing some attributes - quirkmode
Comment 9 Germain Garand 2006-10-13 00:49:32 UTC
Created attachment 18101 [details]
testing some attributes -  strict mode

Testing HTML:
apparently, for Gecko an Opera browsers, strict/loose doctype only make a
difference for the "." selector (case sensitive for trans/strict doctype, case
insensitive otherwise) which is then not consistent in behaviour with the
[class=~] selector (!!)

@Niels: from what did you infer that list, and what's the rationale /wrt
specification? from the few attributes I tested, your list seems consistant
with Opera 9 behaviour, but not with Gecko.
Comment 10 Niels Leenheer (rakaz) 2006-10-13 08:29:38 UTC
The list is based on the attributes marked with [CS] or [CA] in the HTML 4 specification:
http://www.w3.org/TR/html4/types.html#case-sensitive

Apart from these which should definately be treated case sensitive, there is also a category which is a bit unclear about how they should be treated. I suppose they should be treated case-sensitive also - just to be on the safe side.

Marked with [CT]: profile, background, cite, href, src, longdesc, classid, codebase, data, archive, usemap, action, onload, onunload, onclick, ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onfocus, onblur, onkeypress, onkeydown, onkeyup, onsubmit, onreset, onselect, onchange

All of these attributes can contain a script, uri or uri list. 

http://www.w3.org/TR/html4/types.html#type-uri:
"URIs in general are case-sensitive. There may be URIs, or parts of URIs, where case doesn't matter (e.g., machine names), but identifying these may not be easy. Users should always consider that URIs are case-sensitive (to be on the safe side)." 

http://www.w3.org/TR/html4/types.html#type-script:
"The case-sensitivity of script data depends on the scripting language."
Comment 11 Allan Sandfeld 2006-10-16 11:32:02 UTC
The HTML attributes should only be specialled cased by HTML parser. It can uppercase or lowercase them before parsing them on. 

 So I think for the selector implementation the safest assumption would be to make case-sensitive matching for strict HTML and all XML. This just needs some hacking in the parser then.
Comment 12 Allan Sandfeld 2006-10-16 12:01:33 UTC
SVN commit 595962 by carewolf:

Make sure advanced attribute selectors match case-insensitive in HTML and that
list selectors match case-sensitive in XHTML.
This means attribute selection is still sligtly wrong in Strict HTML, but
matches Firefox behaviour.
CCBUG: 135505


 M  +21 -22    cssstyleselector.cpp  


--- branches/KDE/3.5/kdelibs/khtml/css/cssstyleselector.cpp #595961:595962
@@ -1084,30 +1084,29 @@
 
     if(sel->attr)
     {
+        // attributes are always case-sensitive in XHTML
+        // attributes are sometimes case-sensitive in HTML Strict
+        // ### for now we only treat id and class selector as case-sensitive in HTML strict
+        bool caseSensitive = e->getDocument()->htmlMode() == DocumentImpl::XHtml;
+        bool caseSensitive_alt = strictParsing || caseSensitive;
+
         DOMStringImpl* value = e->getAttributeImpl(sel->attr);
         if(!value) return false; // attribute is not set
 
         switch(sel->match)
         {
-        case CSSSelector::Exact:
-            /* attribute values are case insensitive in all HTML modes,
-               even in the strict ones */
-            if ( e->getDocument()->htmlMode() != DocumentImpl::XHtml ) {
-                if ( strcasecmp(sel->value, value) )
-                    return false;
-            } else {
-                if ( strcmp(sel->value, value) )
-                    return false;
-            }
+        case CSSSelector::Set:
+            // True if we make it this far
             break;
         case CSSSelector::Id:
-	    if( (strictParsing && strcmp(sel->value, value) ) ||
-                (!strictParsing && strcasecmp(sel->value, value)))
-                return false;
+            caseSensitive = caseSensitive_alt;
+            // no break
+        case CSSSelector::Exact:
+            return (caseSensitive && !strcmp(sel->value, value)) ||
+                   (!caseSensitive && !strcasecmp(sel->value, value));
             break;
-        case CSSSelector::Set:
-            break;
         case CSSSelector::Class:
+            caseSensitive = caseSensitive_alt;
             // no break
         case CSSSelector::List:
         {
@@ -1118,8 +1117,8 @@
             // Selector string may not contain spaces
             if ((sel->attr != ATTR_CLASS || e->hasClassList()) && sel->value.find(' ') != -1) return false;
             if (sel_len == val_len)
-                return (strictParsing && !strcmp(sel->value, value)) ||
-		       (!strictParsing && !strcasecmp(sel->value, value));
+                return (caseSensitive && !strcmp(sel->value, value)) ||
+		       (!caseSensitive && !strcasecmp(sel->value, value));
             // else the value is longer and can be a list
             if ( sel->match == CSSSelector::Class && !e->hasClassList() ) return false;
 
@@ -1131,7 +1130,7 @@
 
             int pos = 0;
             for ( ;; ) {
-                pos = val_str.string().find(sel_str.string(), pos, strictParsing);
+                pos = val_str.string().find(sel_str.string(), pos, caseSensitive);
                 if ( pos == -1 ) return false;
                 if ( pos == 0 || val_uc[pos-1].isSpace() ) {
                     int endpos = pos + sel_len;
@@ -1147,21 +1146,21 @@
             //kdDebug( 6080 ) << "checking for contains match" << endl;
             QConstString val_str(value->unicode(), value->length());
             QConstString sel_str(sel->value.unicode(), sel->value.length());
-            return val_str.string().contains(sel_str.string());
+            return val_str.string().contains(sel_str.string(), caseSensitive);
         }
         case CSSSelector::Begin:
         {
             //kdDebug( 6080 ) << "checking for beginswith match" << endl;
             QConstString val_str(value->unicode(), value->length());
             QConstString sel_str(sel->value.unicode(), sel->value.length());
-            return val_str.string().startsWith(sel_str.string());
+            return val_str.string().startsWith(sel_str.string(), caseSensitive);
         }
         case CSSSelector::End:
         {
             //kdDebug( 6080 ) << "checking for endswith match" << endl;
             QConstString val_str(value->unicode(), value->length());
             QConstString sel_str(sel->value.unicode(), sel->value.length());
-            return val_str.string().endsWith(sel_str.string());
+            return val_str.string().endsWith(sel_str.string(), caseSensitive);
         }
         case CSSSelector::Hyphen:
         {
@@ -1172,7 +1171,7 @@
             const QString& selStr = sel_str.string();
             if(str.length() < selStr.length()) return false;
             // Check if str begins with selStr:
-            if(str.find(selStr, 0, strictParsing) != 0) return false;
+            if(str.find(selStr, 0, caseSensitive) != 0) return false;
             // It does. Check for exact match or following '-':
             if(str.length() != selStr.length()
                 && str[selStr.length()] != '-') return false;
Comment 13 Niels Leenheer (rakaz) 2006-10-16 12:34:47 UTC
The method that Firefox uses is slightly more than using case-sensitive for XHTML and case-insensitive for HTML.

During the parsing of the CSS, it will compare the tagname with a build in list (a subset of the list above). If the tagname is present in the list, it will set a bit that determines if the selector is evaluated with a case-sensitive or a case-insensitive comparison method.

The reason for comparing during the parsing is simple. This prevents that you need to iterate the list and compare the tagname everytime the selector is evaluated. During the evaluation of the selector it only needs to look at the bit and the type of document (XHTML or HTML) to determine which method it should use for the comparison.

The only problem that Firefox has is that it's list is not complete. A simple patch will solve that and make Firefox fully compatible. 

As already noted in the current patch for KHTML, it only solves some of the symptoms. It does not make KHTML any more compatible that it currently is.
Comment 14 Niels Leenheer (rakaz) 2006-10-16 19:16:26 UTC
I wrote an article about case sensitivity: There are a lot of new details that may impact this bug.

http://rakaz.nl/item/css_selector_bugs_case_sensitivity

The article contains a complete list of how each element should be treated, for both HTML and XHTML.
Comment 15 Allan Sandfeld 2006-10-16 20:06:07 UTC
You are over interpreting the standard. Standards are buggy and shouldn't be read literally. If you are in contact with most of the browsers developers, then propose an acceptable and SIMPLE suggestion on how to treat attribute selectors.

Compared to your suggestion in comment #7, it seems Firefox treat case-sensitive attributes case-sensitive even in quirks mode (except if you use the class selector). 

As for all the special rules in your article, I would guess it's best to just ignore them and just treat CA,CT and CN as CS.
Comment 16 Niels Leenheer (rakaz) 2006-10-16 21:07:27 UTC
The rules in the article are the literal interpretation of the standard. I do realize it is not a workable solution to the problem. 

What we need is a single simple list that determines how the attribute should be treated - without any exceptions. I believe such a list can be quite simple to derive from the lists in the article.

The problematic attributes are name and type. Simply treat name as case-sensitive. It might not be strictly compatible with the standard, but at least it is workable. The same applies to type, except treat it as case-insensitive. It will give problems for the ol element, but since it is deprecated and replaced by CSS it should not give any real-world problem.

To make sure the list is as small as possible we need to treat the neutral elements the same as the case-sensitive elements.

This leaves us with a list of 45 attributes for HTML (both Quirksmode and Strict) that should be treated as case-insensitive. Every other attribute can be treated in a case-sensitive way.

lang, dir, http-equiv, text, link, vlink, alink, compact, align, frame, rules, valign, scope, axis, nowrap, hreflang, rel, rev, charset, codetype, declare, valuetype, shape, nohref, media, bgcolor, clear, color, face, noshade, noresize, scrolling, target, method, enctype, accept-charset, accept, checked, multiple, selected, disabled, readonly, language, defer, name

The list of attributes is only 22 elements for XHTML. Other attributes can be treated in a case-sensitive way.

http-equiv, text, link, vlink, alink, lang, axis, hreflang, rel, rev, charset, codetype, media, bgcolor, color, face, target, enctype, accept-charset, accept, language, name

I think this is as compatible as we can get without going overboard with rules.
Comment 17 Niels Leenheer (rakaz) 2006-10-16 21:44:10 UTC
ARG... still a mistake in the list.
I think the list below is the final and correct version.
Sorry for the bugspam.

------------------

This leaves us with a list of 45 attributes for HTML (both Quirksmode and Strict) that should be treated as case-insensitive. Every other attribute can be treated in a case-sensitive way.

lang, dir, http-equiv, text, link, vlink, alink, compact, align, frame, rules, valign, scope, axis, nowrap, hreflang, rel, rev, charset, codetype, declare, valuetype, shape, nohref, media, bgcolor, clear, color, face, noshade, noresize, scrolling, target, method, enctype, accept-charset, accept, checked, multiple, selected, disabled, readonly, language, defer, type

The list of attributes is only 22 elements for XHTML. Other attributes can be treated in a case-sensitive way.

http-equiv, text, link, vlink, alink, lang, axis, hreflang, rel, rev, charset, codetype, media, bgcolor, color, face, target, enctype, accept-charset, accept, language, type

I think this is as compatible as we can get without going overboard with rules.
Comment 18 Germain Garand 2006-10-17 03:05:47 UTC
I tend to agree with Allan that the described logic, though very thorough is rather on the complicated side for a matter that is, in the end, pretty much non-specified..
How could it be a win for document writers to have such a complicated decision process (even with a "workable list", this is still one more list)?

I'd rather not make premature choices arbitrarily raising the complexity bar..

Anyway, I'll attach my uber simple approach to the problem, based on attr list obtained with

grep -E '\[C(S|T)\]' ~/html40.txt  | perl -pe 's/\s+(\w+).*/$1/' - | sort | uniq

(beware that you need to run "perl misc/makeattrs")

Advantage is speed and lack of complexity.
Allan, what do you think of it?
Comment 19 Germain Garand 2006-10-17 03:09:10 UTC
Created attachment 18144 [details]
patch
Comment 20 Maksim Orlovich 2006-10-17 03:16:19 UTC
Should it be case-sensitive for custom attributes?
Comment 21 Germain Garand 2006-10-17 03:43:23 UTC
Good question. Either way seems equally arguable to me.
Case-insensitive matches Gecko.
Mmh..
Comment 22 Niels Leenheer (rakaz) 2006-10-17 10:17:42 UTC
The problem with grepping the specification is that the specification isn't intended to be machine-readable. Most of the results are accurate, but some are more or less arbitrary. I'd rather have a list based on a sound decision process than a list based on arbitrary undefined rules.

Amazingly, the results of grep are exactly identical as the list I posted in #17. The grep list contains all case-sensitive attributes. My list contains all case-insensitive attributes. I chose to use the list of case-insensitive attributes because it is slightly smaller than the list of case-sensitive attributes. So I can live the results, even though I do not like the way the results are reached. 

Dropping the separate list for XHTML is something I can live with. Even though it is not entirely according to the spec it will not cause problems in the real world.
Comment 23 Allan Sandfeld 2006-10-18 16:32:04 UTC
ggarand's patch looks good. 

The only thing I disagree with, is that I still believe XHTML attributes should always be treated case-sensitive. Especially because there is no difference between XHTML and generic XML attributes since XML namespaces are not applied to attributes. In other words the attributes can not be treated language specific in XML.
Comment 24 Germain Garand 2006-10-19 01:28:42 UTC
> In other words the attributes can not be treated language specific in XML. 

that's a good argument indeed
(and AFAICS the patch doesn't modify that, as the new cs value is just or'ed in the previous one, that already tested for XHTML).

@niels: thanks for your assessment, that's very much appreciated. It's great that the two lists match up in the end, so we can keep the rule simple, be it machine-extracted or not.
Comment 25 Niels Leenheer (rakaz) 2006-10-19 08:28:59 UTC
> The only thing I disagree with, is that I still believe XHTML attributes 
> should always be treated case-sensitive. Especially because there is no 
> difference between XHTML and generic XML attributes since XML namespaces are 
> not applied to attributes. 

It is true that an attribute needs to have it's own namespace prefix. It will not inherit the namespace of the element it belongs to. So in an XHTML document it is quite common that attributes are in the global namespace.

> In other words the attributes can not be treated language specific in XML. 

That is where I believe you are wrong. If that the most common notation of XHTML wouldn't work at all. This notation does not prefix the attributes at all, it only brings the elements into the XHTML namespace. 

Take for example: 

<html xml:ns='http://www.w3.org/1999/xhtml'> 
...
<img src='something.png' alt='' />

If you were right, the img element would not show any image at all. The src attribute is not in the XHTML namespace, so we cannot treat it as such. XML does not define any language trancending behavoir, so we must ignore the src attribute.

In reality we use the XHTML language specific behavoir for attributes of elements in the XHTML namespace regardless if their attributes are in the XHTML namespace or in the global namespace. If we can do this for specific behavoir like the src attribute, we can also do this with the case-sensitivity. 
Comment 26 Niels Leenheer (rakaz) 2006-10-19 17:10:13 UTC
Just one more thing about the case-sensitivity of unknown attributes. I believe that all unknown attributes should be treated in a case-sensitive way - for both HTML and XHTML. Case-insensitivity is the exception.

However if KHTML implements other standards we should take a good look at how that standard specifies the case-sensitivity of their attribute. Take for example the WHATWG WebApps spec. It defines the contenteditable attribute as case-insensitive.
Comment 27 Germain Garand 2006-10-19 18:21:23 UTC
> If you were right, the img element would not show any image at all.[...]
> XML does not define any language trancending behavoir, so we must ignore the > src attribute.

this argument doesn't make sense to me. Attributes in global namespace have absolutely no intrinsic properties or behaviour... it is the XHTML *element* that gives a special purpose/meaning to its attributes, and modifies its own state accordingly.

This special meaning is completely confined within the element, it doesn't have any effect on the attribute itself.

Therefore, it is quite possible the XHTML element will ignore the case of its attributes for all internal purposes, but there is no way this knowledge can be transfered to the style matching process.
Comment 28 Niels Leenheer (rakaz) 2006-10-20 11:20:09 UTC
My point wasn't that attributes have intrinsic properties or behavoir on its own. Attributes do not live in a vacuum - they are part of an element - and even if attribute is in the global namespace we can still determine what property or behavoir that attribute should give the element. I'm not a native english speaker, so I probably could have been clearer about that.

However, my point was that we already use *special knowledge* about elements and its attributes in KHTML. I see no reason why we should not use that knowledge in other areas. 

> This special meaning is completely confined within the element, it doesn't
> have any effect on the attribute itself.

True, the value of attribute itself is not affected, but it does affect the way the element interacts with that particular attribute. 

> Therefore, it is quite possible the XHTML element will ignore the case of 
> its attributes for all internal purposes, but there is no way this knowledge 
> can be transfered to the style matching process.

This is the point where we disagree. I believe that that knowledge can be transferred to the style matching process. In fact, I believe that the CSS specification demands that we transfer that knowledge:

"The case-sensitivity of attribute names and values in selectors depends on the document language."

Comment 29 Germain Garand 2006-10-20 12:26:53 UTC
My point of view is that having to change the way CSS style matching is done on attributes case by case and according to the *HTML* specification is already an horrible hack that breaks CSS encapsulation, one that I only consider as a compatibility measure because it's implemented that way in 2 significant other engines and because the CSS specification is fuzzy on that topic.

But please, let's just restrict that to good old broken HTML for now, and if the CSS WG ever feel the need to be more specific about what they meant with "depends on the document language", then we will see. 

So, I'll commit what I have, modified to consider unknown attributes case-sensitive, and I'll let anyone interested go deeper in that question by discussing it on the WG lists.
 
Comment 30 Germain Garand 2006-10-20 22:26:57 UTC
SVN commit 597587 by ggarand:

second part of #135505 fix.

Some attribute values have to be treated case-insensitively during 
style selection in order to match HTML's caprices.

Attributes marked with [CS] and [CT] are treated case-sensitively,
along with unknown attributes.

BUG: 135505



 M  +7 -5      css/cssstyleselector.cpp  
 M  +199 -199  misc/htmlattrs.c  
 M  +152 -151  misc/htmlattrs.h  
 M  +53 -48    misc/htmlattrs.in  
 M  +6 -0      misc/makeattrs