Bug 139302 - incorrect parsing of some RSS feeds
Summary: incorrect parsing of some RSS feeds
Status: RESOLVED WORKSFORME
Alias: None
Product: akregator
Classification: Applications
Component: general (show other bugs)
Version: unspecified
Platform: Debian testing Linux
: NOR normal
Target Milestone: ---
Assignee: kdepim bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-12-28 10:15 UTC by Chet Murthy
Modified: 2008-10-26 22:33 UTC (History)
1 user (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Chet Murthy 2006-12-28 10:15:09 UTC
Version:            (using KDE KDE 3.5.5)
Installed from:    Debian testing/unstable Packages
OS:                Linux

In Akregator, for some RSS feeds (e.g. http://feeds.feedburner.com/Hullabaloo), the feed is not properly parsed, and a large number of entries show up as empty, while a similarly large number show up mangled, with bogus dates that are in the future.

In the case of http://economistsview.typepad.com/economistsview/index.rdf, I switched to the ATOM format feed, and the problem did not occur anymore.

So I suspect it's something wrong with the way that RSS is being handled.
Comment 1 deleted_email_KsJQa 2006-12-29 12:38:26 UTC
It might be something similar to bug #139043, though I cannot find date problems with http://feeds.feedburner.com/Hullabaloo (nor with http://blog.beryl-project.org/?feed=rss2 from the other bug report).

Chet, do you have the same problem with "http://www.tatanka.com.br/ies4linux/news/feed/"? (check comment 1, from the other bug report, for what problems I encounter).

I tried to find the problem myself, but I had some difficulty building a working development environment (well, not that I really searched for documentation), so I gave up.

In "kdepim-3.5.5/akregator/src/librss/", file "article.cpp", lines from 114 to 140, is the place where dates are extracted and converted.

In the same directory, file "tools_p.cpp", lines 20 to 31, is defined the "parseISO8601Date()" function (from Atom, so it shouldn't be used for RSS 2.0 feeds).

In "kdelibs-3.5.5/kdecode/", file "krfcdate.cpp", is defined the function "KRFCDate::parseDate()", which is probably where there might be some problem, if it is not a global parsing problem.

A basic workaround -well, the problem should probably be easy to find, with a working development environment- (could it be implemented for kde-3.5.6? bugs can still be corrected, right?), would be to detect some parsing problems, like a date too far in the past or in the future, and simply do not set this date, and keep the default fetch date... (and do the same, when only the time is detected -badly-, like " 00:00").
Comment 2 Chet Murthy 2006-12-29 22:02:59 UTC
Mathieu,

Thank you for your quick email.  Neither of the two URLs (beryl or
tatanka) caused a problem in my Akregator.

I do not think the problem is with date-parsing.  I believe it is with
the parsing of the RSS feed into separate entries.  I say this because
whenever this happens, it is associated with many, many entries.

The newest entries look somewhat like valid entries, but their
subjects contain partial URLs,

digbysblog_archive.html#115016467939408795
Date:Thursday 20 November 2031 10:47 pm

but the body of the entry is intact and correct.  Then, farther down
the list (earlier in timestamps) there are entries that are empty,
with no subject.

Is there a way I can enable some sort of akregator trace?

Thanks,
--chet--

On Friday 29 December 2006 06:38, Mathieu Bonnet wrote:
[bugs.kde.org quoted mail]
Comment 3 deleted_email_KsJQa 2006-12-29 22:52:58 UTC
>
> Neither of the two URLs (beryl or
> tatanka) caused a problem in my Akregator.
>


Well, I just retried, with "http://www.tatanka.com.br/ies4linux/news/feed/", after having stopped Akregator and removed the "http___www.tatanka.com.br_ies4linux_news_feed_.mk4" archive file (in ~/kde/share/apps/akregator/Archive), and there is no more date problem, though there does not seem to be any change to the feed... (I'll ask the website owner about it, just in case).

However, the entries still disappear, when unselecting the feed, and selecting it again... Is this intentional, when Akregator is set to only keep x days of archive, to remove older items, when you unselect-select the feed, even when you just fetched the news? (instead of taking into account the fetch date, it only takes into account, the specified date? -well, when a date is specified).

There is still a bug, though... when I select the feed, it says "IEs 4 Linux News (4294967286 unread articles)"... (though there is no more article in the article list... -and there wasn't even 10 articles ;)). When I restart Akregator, and select the feed, it says, properly "no unread articles"...

This might warrant a separate bug report, but it all seems pretty much related to a parsing and probably archive problem...


>
> The newest entries look somewhat like valid entries, but their
> subjects contain partial URLs,
> 
> digbysblog_archive.html#115016467939408795
> Date:Thursday 20 November 2031 10:47 pm 
>


I don't remember anything like this... In my case, as explained on bug #139043 comment 1, the dates were like "2935093-02-24 23:59", or " 00:00", and the order was random. I didn't check the content of every articles, but I don't remember anything strange... (though the date problem catched my attention, and I might have missed other problems...).
Comment 4 deleted_email_KsJQa 2007-01-01 00:10:40 UTC
I contacted the owner of "http://www.tatanka.com.br/ies4linux/news/feed/", and nothing was whanged to the feed since the first time I fetched it, and encountered the mentioned problem.

I didn't update anything either, on my computer, so it really seems there is something wrong with parsing and archiving...

I hope some KDE/Akregator developer can reproduce it, and/or find where it comes from, though it does not seem strictly reproducible...
Comment 5 Frank Osterfeld 2007-01-12 16:06:16 UTC
I can't confirm parsing problems with any of the reported feeds. Mathieu, could you tell me your expiration settings for the feed where switching feeds empties the list?
Comment 6 deleted_email_KsJQa 2007-01-12 16:50:06 UTC
>
> I can't confirm parsing problems with any of the reported feeds.
>


Well, there was a new article, since then, in the "http://www.tatanka.com.br/ies4linux/news/feed/" feed, and it was fetched and displayed normally (no date problem, and the article is kept).

If someone have some time to check the functions related to RSS 2 feed parsing, and date parsing/conversion, for any problem which could create dates like I (and the two other reporters) had, maybe something could be found...

Maybe the workaround in comment #1 could be applied, temporarily, with debug informations (feed address, fetch date, entry title, original date data -with non-ASCII chars in hexadecimal, to easily detect encoding problems, if there is-, parsed date, and final conversion as it appears in Akregator) to stderr or some temporary log file, and a warning to users, so they might report the context in which the problem happened (some general informations might be added to the log, notably the versions used).

Well, it seems pretty rare, and I don't have much time to test things myself, so I won't insist.


>
> could you tell me your expiration settings for the feed
> where switching feeds empties the list? 
>


Seven days, and I guess this is simply how it works, for now, with Akregator... (only taking into account the article dates, instead of the fetch date, to remove archived articles, as I suggested in comment #3). It's a usability problem, maybe someone already opened a enhancement request (I don't have the time to check and open a new one, though).
Comment 7 Frank Osterfeld 2008-10-26 22:33:26 UTC
Worksforme in 4.1, please create a report with example feed if you can reproduce this in >= 4.1.