Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to parse a web page in Ruby?

I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how can I get the data into a usable format?

Say I want to parse my own user page for my current reputation score and badge listing. I tried to convert the source retrieved from my user page into xml but the conversion failed due to a missing div. I know I could do a string compare and find the text I'm looking for, but there has to be a much better way of doing this.

I want to incorporate this into a simple script that spits out my user data at the command line, and possibly expand it into a GUI application.

like image 404
Jeremy Mack Avatar asked Sep 26 '08 03:09

Jeremy Mack


People also ask

Is Ruby good for web scraping?

Web scraping with Ruby is all about finding and choosing the right gem. Considerable amounts of gems are developed to cover all steps of the web scraping process from sending HTML requests to creating CSV files. Ruby's gems, such as HTTParty and Nokogiri, are perfectly suitable for static web pages with constant URLs.

What is Web scraping in Ruby?

Web Scraping is used to extract useful data from websites. This extracted data can be used in many applications. Web Scraping is mainly useful in gathering data while there is no other means to collect data — eg API or feeds. Creating a Web Scraping Application using Ruby on Rails is pretty easy.

Is it possible to parse HTML in Ruby?

This task can be a bit difficult if you don’t have the right tools. But today you’re in luck! Because Ruby has this wonderful library called Nokogiri, which makes HTML parsing a walk in the park. Let’s see some examples.

How do I parse data directly from a URL?

parsed_data = Nokogiri::HTML.parse (html) puts parsed_data.title => "test" If you want to parse data directly from a URL, instead of an HTML string… This will download the HTML & get you the title. Getting the title is nice, but you probably want to see more advanced examples. Right? Let’s take a look at how to extract links from a website.

How to send a request to a website in Ruby?

In order to send a request to any website or web app, you would need to use an HTTP client. Let's take a look at our three main options: net/http, open-uri and HTTParty. You can use whichever of the below clients you like the most and it will work with the step 2. Ruby standard library comes with an HTTP client of its own, namely, the net/http gem.

Can I use Ruby for web scraping without getting blocked?

While this whole article tackles the main aspect of web scraping with Ruby, it does not talk about web scraping without getting blocked. If you want to learn how to do it, we have wrote this complete guide, and if you don't want to take care of this, you can always use our web scraping API.


3 Answers

Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.

require 'hpricot'
require 'open-uri'

doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i

And so forth.

like image 104
Armin Ronacher Avatar answered Oct 21 '22 18:10

Armin Ronacher


Hpricot is over !

Use Nokogiri now.

like image 5
AnkitG Avatar answered Oct 21 '22 18:10

AnkitG


try hpricot, its well... awesome

I've used it several times for screen scraping.

like image 5
ethyreal Avatar answered Oct 21 '22 17:10

ethyreal