Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting new items from an RSS feed

I'm writing an application which takes data input from a series of arbitrary RSS feeds. The feeds are polled asynchronously in the background and a method is called every time a new item is added to the feed.

My problem is identifying the new items in the feed. What's the best way to do it? I have come up with a few ideas, but they're all flawed.

Suggestion: Every time you poll, keep all items newer than the pubDate of the last item in the last poll Problem: pubDate is not a required field.

Suggestion: Keep a hash of the content for every item you return, and do not return content with the same hash Problem: Rapidly grows out of control in terms of memory usage

like image 787
Martin Avatar asked Dec 17 '10 10:12

Martin


1 Answers

How about both?

Use pub-date on those feeds that do return it, and keep a hash of the others. If most of the feeds return a pub-date, and the number of feeds does not run into the millions, you should be ok, both performance and memory wise.

like image 124
SWeko Avatar answered Sep 28 '22 08:09

SWeko