Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Rails HTML Parser [closed]

Tags:

html

parsing

ruby

I know that Hpricot is still a standard but I remember hearing about a faster more expressive HTML parser for Ruby.

Does anybody know what it's called and if it is worth switching to from Hpricot??

Thanks in advance

like image 645
ewakened Avatar asked Dec 27 '08 18:12

ewakened


Video Answer


4 Answers

You are probably thinking about Nokogiri. I have not used it myself, but "everyone" is talking about it and the benchmarks do look interesting:

hpricot:html:doc  48.930000 3.640000 52.570000 ( 52.900035)
hpricot2:html:doc  4.500000 0.020000  4.520000 (  4.518984)
nokogiri:html:doc  3.640000 0.130000  3.770000 (  3.770642)
like image 68
Wes Oldenbeuving Avatar answered Oct 17 '22 14:10

Wes Oldenbeuving


There are multiple tools available. I use Nokogiri.

Demo:

require 'rubygems'
require 'nokogiri'

doc = Nokogiri::HTML(%{
  <h1 class="title">Hello, World</h1>
  <p>Some text</p>
  <a href="http://www.google.com/">Some link</a>
})

title   = doc.at_css("h1.title").text
content = doc.at_css("p").text
url     = doc.at_css("a")[:href]

Ryan Bates made an excelent screencast about using it: #190: Screen Scraping with Nokogiri.

Documentation: http://nokogiri.org/

Tutorials: http://nokogiri.org/tutorials

like image 32
iblue Avatar answered Oct 17 '22 13:10

iblue


There is also Rubyful Soup

Which sells itself as a lightweight quick and dirty parser. I found the interface very intuitive and 'Ruby-ish' when using it for a project in the past, which is perhaps a little surprising given that it is a Python port.

Edit: looks like it's no longer maintained unfortunately so it's probably not the one you were looking for. Looks like Nokogiri is the on you've been hearing about.

like image 2
maxaposteriori Avatar answered Oct 17 '22 13:10

maxaposteriori


Don't use regular expressions -- ruby's regex stuff is way too slow. Hpricot is awesome and Nokogiri looks promising, though I've not used it directly yet.

like image 1
sammich Avatar answered Oct 17 '22 12:10

sammich