Bug 408079 - Fetching feeds causes duplicated items
Summary: Fetching feeds causes duplicated items
Status: REPORTED
Alias: None
Product: akregator
Classification: Applications
Component: feed parser (show other bugs)
Version: GIT (master)
Platform: Debian unstable Linux
: NOR normal
Target Milestone: ---
Assignee: kdepim bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-29 18:20 UTC by Daniel Roschka
Modified: 2019-05-29 18:20 UTC (History)
0 users

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Roschka 2019-05-29 18:20:00 UTC
When fetching a feed multiple times akregator duplicates existing items when the content of a fetched item differs from the content of the same item already available locally. I'm suffering from this bug now since 10+ years and would like to see it finally gone.

Here is my theory why it happens:

Instead of using the guid only to compare two items for equality, Akregator builds a hash over title, description, content, link and author (https://github.com/KDE/akregator/blob/0d588dcbfb9cc93dec5b6bcbf3b01336ca1d09ce/src/feed/feed.cpp#L581-L585 and https://github.com/KDE/akregator/blob/0d588dcbfb9cc93dec5b6bcbf3b01336ca1d09ce/src/article.cpp#L189) and checks that as well, unless the guid started with "hash:". I believe this is not according to the specification, which states:

> guid stands for globally unique identifier. It's a string that uniquely identifies the item.
> When present, an aggregator may choose to use this string to determine if an item is new.
> 
> <guid>http://some.server.com/weblogItem3207</guid>
> 
> There are no rules for the syntax of a guid. Aggregators must view them as a string. It's up to
> the source of the feed to establish the uniqueness of the string.

http://www.rssboard.org/rss-specification#ltguidgtSubelementOfLtitemgt

The current behavior produces duplicate items when authors fix typos in their posts or when software inserts random bits in the data (e.g. in Javascript included in the markup (Podlove Publisher is known for that (https://github.com/podlove/podlove-publisher/blob/192a2710b6ad3d0f5eff67f4daacb5d6dac6ab4a/lib/modules/subscribe_button/button.php#L88))). The latter case is particularly annoying as it produces a new item every single time akregator fetches the feed.

I'd be happy to provide additional information if necessary.