Bug 93400

Summary: filtering of double entries in feed lists
Product: [Applications] akregator Reporter: m.wege
Component: generalAssignee: kdepim bugs <kdepim-bugs>
Status: RESOLVED FIXED    
Severity: wishlist CC: c.hamacher
Priority: NOR    
Version: unspecified   
Target Milestone: ---   
Platform: Debian testing   
OS: Linux   
Latest Commit: Version Fixed In:
Sentry Crash Report:

Description m.wege 2004-11-16 22:56:21 UTC
Version:            (using KDE KDE 3.3.1)
Installed from:    Debian testing/unstable Packages

It quite often happens with news feeds (e.g. at Spiegel.de or Netzeitung.de that the same news is linked more than once, when it is updated, often just when the headline has been changed. What seems to be in common is the url which is pointing to the article. What would be really good if akgregator would be able to filter this double entries, at least when the article has not been yet read. If it had been read it could mark an article as updated.
Comment 1 Stephan Binner 2004-11-20 11:29:33 UTC
> What seems to be in common is the url which is pointing to the article.

Note that there are feeds (eg distrowatch apps feed) where all articles point to the same url (homepage). :-)
Comment 2 Vincent P 2005-03-29 12:28:08 UTC
*** This bug has been confirmed by popular vote. ***
Comment 3 Frank Osterfeld 2005-04-20 22:05:49 UTC
CVS commit by osterfeld: 

use rdf:about as ID in RSS 1.0 (RDF) feeds. This should reduce the number of dupes significantly.
(Checking for "rdf:about" instead of resolving the namespace properly is a hack (as one could use another prefix 
for the RDF namespace), but attributeNS() didn't work)
CCBUG: 93400


  M +21 -8     article.cpp   1.23


--- kdepim/akregator/src/librss/article.cpp  #1.22:1.23
@@ -125,7 +125,20 @@ Article::Article(const QDomNode &node, F
     }
 
+    QDomElement element = QDomNode(node).toElement();
+
+    // in RSS 1.0, we use <item about> attribute as ID
+    // FIXME: pass format version instead of checking for attribute
+
+    if (!element.isNull() && element.hasAttribute(QString::fromLatin1("rdf:about")))
+    {
+        d->guid = element.attribute(QString::fromLatin1("rdf:about")); // HACK: using ns properly did not work
+        d->guidIsPermaLink = false;
+    }
+    else
+    {
     tagName=(format==AtomFeed)? QString::fromLatin1("id"): QString::fromLatin1("guid");
     QDomNode n = node.namedItem(tagName);
-        if (!n.isNull()) {
+            if (!n.isNull())
+        {
                 d->guidIsPermaLink = (format==AtomFeed)? false : true;
                 if (n.toElement().attribute(QString::fromLatin1("isPermaLink"), "true") == "false") d->guidIsPermaLink = false;
@@ -134,4 +146,5 @@ Article::Article(const QDomNode &node, F
                         d->guid = elemText;
         }
+    }    
 
         if(d->guid.isEmpty()) {
Comment 4 Frank Osterfeld 2005-04-20 22:37:16 UTC
CVS commit by osterfeld: 

backport: use rdf:about in RSS 1.0 feeds as guid.
CCBUG: 93400


  M +21 -8     article.cpp   1.22.6.1


--- kdepim/akregator/src/librss/article.cpp  #1.22:1.22.6.1
@@ -125,7 +125,20 @@ Article::Article(const QDomNode &node, F
     }
 
+    QDomElement element = QDomNode(node).toElement();
+
+    // in RSS 1.0, we use <item about> attribute as ID
+    // FIXME: pass format version instead of checking for attribute
+
+    if (!element.isNull() && element.hasAttribute(QString::fromLatin1("rdf:about")))
+    {
+        d->guid = element.attribute(QString::fromLatin1("rdf:about")); // HACK: using ns properly did not work
+        d->guidIsPermaLink = false;
+    }
+    else
+    {
     tagName=(format==AtomFeed)? QString::fromLatin1("id"): QString::fromLatin1("guid");
     QDomNode n = node.namedItem(tagName);
-        if (!n.isNull()) {
+            if (!n.isNull())
+        {
                 d->guidIsPermaLink = (format==AtomFeed)? false : true;
                 if (n.toElement().attribute(QString::fromLatin1("isPermaLink"), "true") == "false") d->guidIsPermaLink = false;
@@ -134,4 +146,5 @@ Article::Article(const QDomNode &node, F
                         d->guid = elemText;
         }
+    }    
 
         if(d->guid.isEmpty()) {
Comment 5 Heinrich Wendel 2005-06-08 16:51:44 UTC
different feets may have the same articels as well, e.g. people might publish their things on planetgnome and planetfreedesktop, this should be filtered in the "All Feeds" list as well
Comment 6 Heinrich Wendel 2005-06-15 19:37:21 UTC
the cleanest solution to implement that would be to check if a article already exists (hash/guid) in feed.h:appendArticles. If the article already exist take the old one and append it to the list. An article must be able to have more than one m_feed then which causes some incompatibilities that have to be considered.
Comment 7 Frank Osterfeld 2005-06-15 20:20:14 UTC
@Heinrich: That would need a global archive, or at least a global article index. The current implementation is based on the assumption that every article is part of exactly one feed and that it is the feed's business to manage his articles (GUIDs are considered unique only per feed, expiry, notification inside of akregator etc.). I won't introduce additional complexity just because of a few articles showing up in multiple aggregator feeds. 

I close this bug because problems with per-feed dupes (original report) are fixed except cases where we have no ID (RSS 0.9x) and can't fix it properly.
Comment 8 Heinrich Wendel 2005-06-16 00:47:54 UTC
Yes, you are right, currently every article can only have one feed, but the global archive could be the "All Feeds" Feed. We could then add an attribute like "duplicates" to the article in which the duplicates are saved. Actions like "mark as read" could then be performed on the article and it's duplicates. In fact I have a lot of duplicates here (at least 20%).
Comment 9 Eckhart Wörner 2005-06-16 21:19:00 UTC
Heinrich Wendel: Please use bug #100784 which deals with that problem for discussion.