85624 – idea: "web scraping" support (support script output as feed source)

Bug 85624 - idea: "web scraping" support (support script output as feed source)

Summary: idea: "web scraping" support (support script output as feed source)

Status:	REPORTED

Alias:	None

Product:	akregator
Classification:	Applications
Component:	general (show other bugs)
Version:	unspecified
Platform:	unspecified Linux

Importance:	NOR wishlist
Target Milestone:	---
Assignee:	kdepim bugs

URL:
Keywords:

Duplicates (3):	177268 187738 451188 (view as bug list)
Depends on:
Blocks:

Reported:	2004-07-21 15:07 UTC by Charles Phoenix
Modified:	2022-03-06 06:16 UTC (History)
CC List:	6 users (show)

See Also:
Latest Commit:
Version Fixed In:
Sentry Crash Report:

Attachments
Add an attachment

Note You need to log in before you can comment on or make changes to this bug.

Description Charles Phoenix 2004-07-21 15:07:55 UTC

Version:           1.0-beta5 "Pierre" (using KDE 3.2.3, Gentoo)
Compiler:          gcc version 3.3.3 20040412 (Gentoo Linux 3.3.3-r6, ssp-3.3.2-2, pie-8.7.6)
OS:                Linux (i686) release 2.6.6-win4lin-r3

I would call this a future development idea.

FYI - "Web scraping is the practice of getting information from a web page and reformatting it."

The idea is to have, hopefully community created, scripts that would convert a non-rss site into an rss formated file. I could easily see the scripts becoming standardized and shared freely. One naming method would be {site}-{date} 
(e.g., www-cnn-com-20040721.py)
I like python. :)

The method would be simple, akregator would have a script associated with a feed. The script outputs a valid xml file so now instead of getting it from the Internet akregator gets it from the script. If the output is invalid akregator would treat it  just like an invalid source thus there is not security issues involved using the scripts. Everything else about akregator remains the same.

Local urls are supported but unless you step up cron jobs there is no automation or central repository.

Comment 1 kiza 2004-11-27 11:30:33 UTC

Hi, just stumbled upon this bug report.

If anyone is interested, I have created such a script repository that is currently used by Liferea (a GTK/Gnome reader) and my own one (console). The webpage is at http://kiza.kcore.de/software/snownews/snowscripts/ and is more or less exactly what Charles suggested I think. :)

Comment 2 Charles Phoenix 2004-11-27 15:51:46 UTC

Yes, it is and this is what I expected... the scripts already exist. Hopefully the idea would catch on and  maybe KDE could standardize on a particular script language. The end result would a repository that *any* RSS could use.

Comment 3 kiza 2004-11-28 11:53:42 UTC

They can already be used by any RSS reader that can either

1) execute and load external scripts as a feed source
or
2) pipe a downloaded resource through a script that converts it to an RSS feed on-the-fly.

#2 has the advantage that you don't need to download the source with an external application and take advantage of the reader's builtin downloader which should support things like conditional GET, compression, etc.

I don't think a particular language needs to be standardized. As long it's commonly used/installed like Perl, Python or bash+GNU textutils.

Anyway, you're free to use and link to this page and help creating more scripts. Of course complaining to the pages the script work for so that they provide a decent RSS feed in the first place is the ultimate goal. ;)

Comment 4 Frank Osterfeld 2008-12-09 00:22:22 UTC

*** Bug 177268 has been marked as a duplicate of this bug. ***

Comment 5 Frank Osterfeld 2009-04-10 19:08:01 UTC

*** Bug 187738 has been marked as a duplicate of this bug. ***

Comment 6 Justin Zobel 2021-03-09 04:11:38 UTC

Thank you for the bug report.

As this report hasn't seen any changes in 5 years or more, we ask if you can please confirm that the issue still persists.

If this bug is no longer persisting or relevant please change the status to resolved.

Comment 7 Forest 2022-01-24 03:36:10 UTC

Yes, the desire for this feature still exists. (I'm another ex-Liferea user.)

Comment 8 Alberto Cavalin 2022-01-24 08:42:33 UTC

Thank you, but in the meantime I managed to wrote my own feed aggregator with full scraping support :-D
If someone is interested here it is: https://github.com/acavalin/rrss

Comment 9 genghiskhan 2022-03-06 06:10:37 UTC

*** Bug 451188 has been marked as a duplicate of this bug. ***

Comment 10 genghiskhan 2022-03-06 06:16:17 UTC

This is a must feature.

There are several web scrapping scripts that provide users with abilities to select a set of rules (CSS Selectors or XPath) and the script does the rest of the work.

I'm using it extensively with Liferea.

No RegEx is needed.