Using feedparser or some other Python library to download and parse RSS feeds; how can I reliably detect new
items and modified
items?
So far I have seen new items in feeds with publication dates earlier than the latest item. Also I have seen feed readers displaying the same item published with slightly different content as seperate items. I am not implementing a feed reader application, I just want a sane strategy for archiving feed data.
The RSS aggregator checks websites for new content automatically. It immediately pulls that content over to your feed reader so you don't have to go and check each website individually to find new content.
To check an RSS feed's validity, you can use an RSS validator, such as the one at http://feedvalidator.org/. To validate your RSS feed, all you have to do is enter the URL of your feed into the text field (Figure 3.35) and click the Validate button.
As all good tutorials on the subject will tell you, metadata is data about data. In the case of RSS 2.0, this includes the name of the author of the feed, the date the channel was last updated, and so on. In Example 5-1, the bold code is the metadata.
Push Content – Providing your visitors with an RSS feed of your dynamic content (such as news) allows them an easy way to receive and consume the content without having to remember to visit your website. In short, your content is pushed out to them instead of waiting on them to come to you.
It depends on how much you trust the feed source. feedparser provides an .id attribute for feed items -- this attribute should be unique for both RSS and ATOM sources. For an example, see eg feedparser's ATOM docs. Though .id will cover most cases, it's conceivable that a source might publish multiple items with the same id. In that case, you don't have much choice but to hash the item's content.
There are two HTTP Features in the documentation for feedparser that can accomplish this:
The basic concept is that a feed publisher may provide a special HTTP header, called an ETag, when it publishes a feed. You should send this ETag back to the server on subsequent requests. If the feed has not changed since the last time you requested it, the server will return a special HTTP status code (304) and no feed data.
import feedparser
d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
d.etag``'"6c132-941-ad7e3080"'``
d2 = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', etag=d.etag)
d2.status``304``
d2.feed``{}``
d2.entries``[]``
d2.debug_message``'The feed has not changed since you last checked, so
the server sent no data. This is a feature, not a bug!'
In this case, the server publishes the last-modified date of the feed in the HTTP header. You can send this back to the server on subsequent requests, and if the feed has not changed, the server will return HTTP status code 304 and no feed data.
import feedparser
d = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_')
d.modified``(2004, 6, 11, 23, 0, 34, 4, 163, 0)``
d2 = feedparser.parse('` <http://feedparser.org/docs/examples/atom10.xml>`_', modified=d.modified)
d2.status``304``
d2.feed``{}``
d2.entries``[]``
d2.debug_message``'The feed has not changed since you last checked, so
the server sent no data. This is a feature, not a bug!'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With