RSS Saver

When I was reviewing OSINT tools from the Intel Techniques course, I wondered if there is a better way to keep track of what’s going on with a specific website. You could screenshot that website multiple times per day, every day, but that could be tiresome and potentially rate-limited.

It occurred to me that many times a website will tell you when it’s updated via an RSS feed. Any RSS feed reader will read the feed and output it to the display, but can I use that feed to download the entire page where the feed is linked to without needing to visit the page directly? The answer, of course, is yes, but the problem is that there isn’t already a tool that does that. That’s why I wrote RSS Saver. Here’s how it works:

usage: rss-saver.py [-h] [--url URL] [--output OUTPUT] [--type TYPE]

An RSS feed article downloader.

options:
  -h, --help                    show this help message and exit
  --url URL, -u URL             What is the URL of the RSS Feed?
  --output OUTPUT, -o OUTPUT    Which directory should the articles be saved into?
  --type TYPE, -t TYPE          Do you want "full" or "simple" articles?

Run the script, tell it the url of the rss feed, where to output the files, and how you want them to be saved (full or simple versions).

Here the example output using the rss feed from CNN:

$ ./rss-saver.py -u http://rss.cnn.com/rss/cnn_topstories.rss -o output/ -t full
Article 'Some on-air claims about Dominion Voting Systems were false, Fox News acknowledges in statement after deal is announced' saved to 'output/SomeonairclaimsaboutDominionVotingSystemswerefalseFoxNewsacknowledgesinstatementafterdealisannounced.html'
Article 'Dominion still has pending lawsuits against election deniers such as Rudy Giuliani and Sidney Powell' saved to 'output/DominionstillhaspendinglawsuitsagainstelectiondenierssuchasRudyGiulianiandSidneyPowell.html'
Article 'Here are the 20 specific Fox broadcasts and tweets Dominion says were defamatory' saved to 'output/HerearethespecificFoxbroadcastsandtweetsDominionsaysweredefamatory.html'
Article 'Judge in Fox News-Dominion defamation trial: 'The parties have resolved their case'' saved to 'output/JudgeinFoxNewsDominiondefamationtrialThepartieshaveresolvedtheircase.html'
Article ''Difficult to say with a straight face': Tapper reacts to Fox News' statement on settlement' saved to 'output/DifficulttosaywithastraightfaceTapperreactstoFoxNewsstatementonsettlement.html'
Article 'Millions in the US could face massive consequences unless McCarthy can navigate out of a debt trap he set for Biden' saved to 'output/MillionsintheUScouldfacemassiveconsequencesunlessMcCarthycannavigateoutofadebttraphesetforBiden.html'

The html files are saved in a new “output” file as I specified. Here’s an example. If you click on that link, you might think it looks just like CNN and even has images. The fact is that the images are being pulled in from cnn.com but eventually, they will be deleted. What you’re saving is only the text. If there is an article that you need to save indefinitely, including images, you will want to take a direct screenshot or save a PDF ASAP. However, if you just need to save the text, this is a really great way to do it. Articles often mysteriously disappear or get re-edited before the Wayback Machine gets a chance to save them. By saving your own copy of the text, you can get ahead of the curve.

Some final notes:

  1. Read the GitHub page for the required Python dependencies. I give the commands for Ubuntu/Debian, Arch, and openSUSE for installing them. They can also be installed with PIP if you prefer.
  2. RSS Saver works well with Torsocks and it works in Tails OS. If you are monitoring a blog or other website with an RSS feed from people you don’t want to know you’re watching, this is a good option.
  3. For many, RSS feeds are a relic of the past, and they only link to paywalls. In those cases, RSS Saver is only good for saving headlines and URLs. It won’t magically get past a paywall, and it can be blocked by a captcha. This limits its usefulness for many websites. This is also the case with the Wayback Machine; it can’t always get past those obstacles either.