I need to parse thousands of feeds and performance is an essential requirement. Do you have any suggestions?
Thanks in advance!
I haven't tried it, but I read about Feedzirra recently (it claims to be built for performance) :-
Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the taf2-curb gem for faster http gets, and libxml through nokogiri and sax-machine for faster parsing.
You can use RFeedParser, a Ruby-port of (famous) Python Universal FeedParser. It's based on Hpricot, and it's really fast and easy to use.
http://rfeedparser.rubyforge.org/
An example:
require 'rubygems'
require 'rfeedparser'
require 'open-uri'
feed = FeedParser::parse(open('http://feeds.feedburner.com/engadget'))
feed.entries.each do |entry|
puts entry.title
end
When all you have is a hammer, everything looks like a nail. Consider a solution other than Ruby for this. Though I love Ruby and Rails and would not part with them for web development or perhaps for a domain specific language, I prefer heavy data lifting of the type you describe be performed in Java, or perhaps Python or even C++.
Given that the destination of this parsed data is likely a database it can act as the common point between the Rails portion of your solution and the other language portion. Then you're using the best tool to solve each of your problems and the result is likely easier to work on and truly meets your requirements.
If speed is truly of the essence, why add an additional constraint on there and say, "Oh, it's only of the essence as long as I get to use Ruby."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With