Bug 112491

Summary: CDATA in feed is not handled correctly
Product: [Applications] akregator Reporter: Eckhart Wörner <ewoerner>
Component: feed parserAssignee: kdepim bugs <kdepim-bugs>
Status: RESOLVED WORKSFORME    
Severity: normal CC: bderidder, jkt, muczyjoe, osterfeld, roman.cheplyaka
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: unspecified   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description Eckhart Wörner 2005-09-12 19:19:23 UTC
Version:           1.2 (using KDE 3.4.2, Kubuntu Package 4:3.4.2-0ubuntu0hoary2 )
Compiler:          gcc version 3.3.5 (Debian 1:3.3.5-8ubuntu2)
OS:                Linux (i686) release 2.6.10-5-386

In http://www.blogistan.co.uk/qt/atom.xml , <![CDATA[ ... ]]> is used to mask the articles. These CDATA tags belong to the XML file and should therefore not get passed to KHTML. At the moment, they do get passed to KHTML, resulting in strange rendering results.
Comment 1 Frank Osterfeld 2005-09-30 08:36:09 UTC
This example is not Atom-1.0 compliant.

In Atom, CDATA seems not valid in <content type="html">, according to

http://www.atomenabled.org/developers/syndication/#text

    "If type="html", then this element contains entity escaped html.
    <title type="html">
      AT&amp;amp;T bought &lt;b&gt;by SBC&lt;/b&gt;!
    </title>"

So the feed should use escaped HTML instead of CDATA.
Comment 2 Eckhart Wörner 2005-10-23 11:29:18 UTC
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-cdata-sect says:

"[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>":]"
Comment 3 Eckhart Wörner 2005-11-10 17:13:15 UTC
*** Bug 116051 has been marked as a duplicate of this bug. ***
Comment 4 Frank Osterfeld 2006-01-16 00:16:41 UTC
SVN commit 498704 by osterfeld:

fix atom:content parsing: Don't show tags when for Atom 1.0 feeds with escaped HTML in it

BUG: 112491, 117938


 M  +36 -15    tools_p.cpp  


--- branches/KDE/3.5/kdepim/akregator/src/librss/tools_p.cpp #498703:498704
@@ -47,21 +47,42 @@
 	QDomElement e = node.toElement();
 	QString result;
 
-	if (elemName == "content" && ((e.hasAttribute("mode") && e.attribute("mode") == "xml") || !e.hasAttribute("mode")))
-		result = childNodesAsXML(node);
-	else
-		result = e.text();
-
-	bool hasPre = result.contains("<pre>",false);
-	bool hasHtml = hasPre || result.contains("<");	// FIXME: test if we have html, should be more clever -> regexp
-	if(!isInlined && !hasHtml)						// perform nl2br if not a inline elt and it has no html elts
-		result = result = result.replace(QChar('\n'), "<br />");
-	if(!hasPre)										// strip white spaces if no <pre>
-		result = result.simplifyWhiteSpace();
-
-	if (result.isEmpty())
-		return QString::null;
-
+        bool doHTMLCheck = true;
+ 
+        if (elemName == "content") // we have Atom here
+        {
+            doHTMLCheck = false;
+            // the first line is always the Atom 0.3, the second Atom 1.0
+            if (( e.hasAttribute("mode") && e.attribute("mode") == "escaped" && e.attribute("type") == "text/html" )
+            || (!e.hasAttribute("mode") && e.attribute("type") == "html"))
+            {
+                result = KCharsets::resolveEntities(e.text().simplifyWhiteSpace()); // escaped html
+            }
+            else if (( e.hasAttribute("mode") && e.attribute("mode") == "escaped" && e.attribute("type") == "text/plain" )
+                       || (!e.hasAttribute("mode") && e.attribute("type") == "text"))
+            {
+                result = e.text().stripWhiteSpace(); // plain text
+            }
+            else if (( e.hasAttribute("mode") && e.attribute("mode") == "xml" )
+                       || (!e.hasAttribute("mode") && e.attribute("type") == "xhtml"))
+            {
+                result = childNodesAsXML(e); // embedded XHMTL
+            }
+            
+        }
+        
+        if (doHTMLCheck) // check for HTML; not necessary for Atom:content
+        {
+            bool hasPre = result.contains("<pre>",false);
+            bool hasHtml = hasPre || result.contains("<");	// FIXME: test if we have html, should be more clever -> regexp
+            if(!isInlined && !hasHtml)						// perform nl2br if not a inline elt and it has no html elts
+                    result = result = result.replace(QChar('\n'), "<br />");
+            if(!hasPre)										// strip white spaces if no <pre>
+                    result = result.simplifyWhiteSpace();
+        
+            if (result.isEmpty())
+                    return QString::null;
+        }
 	return result;
 }
 
Comment 5 Eckhart Wörner 2006-02-28 21:52:29 UTC
This bug has only been fixed for Atom, not for RSS. Reopened it therefore.
Comment 6 Eckhart Wörner 2006-02-28 21:53:58 UTC
*** Bug 122857 has been marked as a duplicate of this bug. ***
Comment 7 Peter Avramucz 2007-05-19 15:54:06 UTC
Same here.
Gentoo ~amd64 kde 3.5.6
Please fix this annoying bug! 
Comment 8 Frank Osterfeld 2008-10-15 21:43:24 UTC
considered fixed in 4.x, reopen with a curren test feed (xml file, not link( otherwise