I know that Hpricot is still a standard but I remember hearing about a faster more expressive HTML parser for Ruby.
Does anybody know what it's called and if it is worth switching to from Hpricot??
Thanks in advance
You are probably thinking about Nokogiri. I have not used it myself, but "everyone" is talking about it and the benchmarks do look interesting:
hpricot:html:doc 48.930000 3.640000 52.570000 ( 52.900035)
hpricot2:html:doc 4.500000 0.020000 4.520000 ( 4.518984)
nokogiri:html:doc 3.640000 0.130000 3.770000 ( 3.770642)
There are multiple tools available. I use Nokogiri.
Demo:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(%{
<h1 class="title">Hello, World</h1>
<p>Some text</p>
<a href="http://www.google.com/">Some link</a>
})
title = doc.at_css("h1.title").text
content = doc.at_css("p").text
url = doc.at_css("a")[:href]
Ryan Bates made an excelent screencast about using it: #190: Screen Scraping with Nokogiri.
Documentation: http://nokogiri.org/
Tutorials: http://nokogiri.org/tutorials
There is also Rubyful Soup
Which sells itself as a lightweight quick and dirty parser. I found the interface very intuitive and 'Ruby-ish' when using it for a project in the past, which is perhaps a little surprising given that it is a Python port.
Edit: looks like it's no longer maintained unfortunately so it's probably not the one you were looking for. Looks like Nokogiri is the on you've been hearing about.
Don't use regular expressions -- ruby's regex stuff is way too slow. Hpricot is awesome and Nokogiri looks promising, though I've not used it directly yet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With