I'm writing an application which takes data input from a series of arbitrary RSS feeds. The feeds are polled asynchronously in the background and a method is called every time a new item is added to the feed.
My problem is identifying the new items in the feed. What's the best way to do it? I have come up with a few ideas, but they're all flawed.
Suggestion: Every time you poll, keep all items newer than the pubDate of the last item in the last poll Problem: pubDate is not a required field.
Suggestion: Keep a hash of the content for every item you return, and do not return content with the same hash Problem: Rapidly grows out of control in terms of memory usage
How about both?
Use pub-date on those feeds that do return it, and keep a hash of the others. If most of the feeds return a pub-date, and the number of feeds does not run into the millions, you should be ok, both performance and memory wise.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With