Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect changed and new items in an RSS feed?

Tags:

python

rss

feed

Using feedparser or some other Python library to download and parse RSS feeds; how can I reliably detect new items and modified items?

So far I have seen new items in feeds with publication dates earlier than the latest item. Also I have seen feed readers displaying the same item published with slightly different content as seperate items. I am not implementing a feed reader application, I just want a sane strategy for archiving feed data.

like image 573
muhuk Avatar asked Mar 31 '09 08:03

muhuk


People also ask

Does RSS feed update automatically?

The RSS aggregator checks websites for new content automatically. It immediately pulls that content over to your feed reader so you don't have to go and check each website individually to find new content.

How do you check RSS feed is working?

To check an RSS feed's validity, you can use an RSS validator, such as the one at http://feedvalidator.org/. To validate your RSS feed, all you have to do is enter the URL of your feed into the text field (Figure 3.35) and click the Validate button.

What is RSS metadata?

As all good tutorials on the subject will tell you, metadata is data about data. In the case of RSS 2.0, this includes the name of the author of the feed, the date the channel was last updated, and so on. In Example 5-1, the bold code is the metadata.

Is RSS feed push or pull?

Push Content – Providing your visitors with an RSS feed of your dynamic content (such as news) allows them an easy way to receive and consume the content without having to remember to visit your website. In short, your content is pushed out to them instead of waiting on them to come to you.


2 Answers

It depends on how much you trust the feed source. feedparser provides an .id attribute for feed items -- this attribute should be unique for both RSS and ATOM sources. For an example, see eg feedparser's ATOM docs. Though .id will cover most cases, it's conceivable that a source might publish multiple items with the same id. In that case, you don't have much choice but to hash the item's content.

like image 133
Kimi Avatar answered Oct 02 '22 22:10

Kimi


There are two HTTP Features in the documentation for feedparser that can accomplish this:

1. Using ETags to reduce bandwidth

The basic concept is that a feed publisher may provide a special HTTP header, called an ETag, when it publishes a feed. You should send this ETag back to the server on subsequent requests. If the feed has not changed since the last time you requested it, the server will return a special HTTP status code (304) and no feed data.

    import feedparser
    d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
    d.etag``'"6c132-941-ad7e3080"'``
    d2 = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', etag=d.etag)
    d2.status``304``
    d2.feed``{}``
    d2.entries``[]``
    d2.debug_message``'The feed has not changed since you last checked, so
    the server sent no data.  This is a feature, not a bug!'

2. Using Last-Modified headers to reduce bandwidth

In this case, the server publishes the last-modified date of the feed in the HTTP header. You can send this back to the server on subsequent requests, and if the feed has not changed, the server will return HTTP status code 304 and no feed data.

import feedparser
d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
d.modified``(2004, 6, 11, 23, 0, 34, 4, 163, 0)``
d2 = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', modified=d.modified)
d2.status``304``
d2.feed``{}``
d2.entries``[]``
d2.debug_message``'The feed has not changed since you last checked, so
the server sent no data.  This is a feature, not a bug!'
like image 37
Ron Hudson Avatar answered Oct 02 '22 22:10

Ron Hudson