Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How large RSS reader works (netvibes, Google reader...)

Tags:

I wonder how web applications like Google Reader, Blogline, techronati works, and what technics they follow to parse millions of RSS feeds using cron job at one time?

like image 709
aniss.bouraba Avatar asked Oct 16 '10 16:10

aniss.bouraba


People also ask

Does Google have an RSS reader?

Google introduced a built-in RSS reader in Chrome for Android last year.

How does RSS reader work?

RSS works by having the Web site author maintain a list of notifications on their Web site in a standard way. This list of notifications is called an RSS feed. People who are interested in finding out the latest headlines or changes can check this list.

How fast is RSS feed?

The lag time between posting a story and seeing it pop up in the RSS feed is usually a few minutes, and then it can take another 10 to 15 minutes or so for it to appear in something like Google Reader.


1 Answers

There is a lot of different techniques... the "worst" one being the one that you describe. (time based polling).

The first thing you need to consider is that they may not all do the parsing on the server side. For example, I know that Netvibes was doing the parsing on the client side (but cached the content on the server), so it saved them a lot of resources. This way they would poll feeds only when users asked from them, so there is no need for them to run some kind of time loop.

Time based polling is still, unfortunately the most frequent solution. There are a lot of techniques to determine when is the best time to do a poll. Based on the frequency of past updates, based on the number of users who susbcribed... etc. The old XML-RPC ping servers can also be used by these guys.

The most efficient technique is to use PubSubHubbub, which is a open protocol used by Google Reader, Netvibes and a few thousand other apps (like Digg.com, Twitterfeed, Friendfeed...). It's an open protocol that allows the feed publisher to directly push the content of the feed to subscribing applications. It's very efficient, but requires the publisher to implement it. By chance, all the big blogging platforms (Tumblr, Posterous, Wordpress, Blogger, SixApart... etc) have implemented it. Other feed publishing apps (like feedburner, Gowalla, ...) also implemented it. If you do publish feeds, I would encourage joining this crowd, and if you plan on consuming some, please, implement the susbcriber side as well.

The last solution is to use a 3rd party application do this data gathering (using all the techniques above) and ping you when these feeds actually have new content. I created one : Superfeedr and I believe we do a good job with this. We also normalize the content and do a few other things to help you consume feed data in the simplest and cheap way (polling can be crazy expensive). Also, we use the exact same PubSubHubbub protocol to push content from any feed, which makes it very simple for our users to use our service in addition to subscribing to available hubs.

Also, I should add that I was able to reply quickly to your question, because I use an app that pushes me the content of the feed for questions tagged RSS :)

like image 136
Julien Genestoux Avatar answered Oct 14 '22 00:10

Julien Genestoux