I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how can I get the data into a usable format?
Say I want to parse my own user page for my current reputation score and badge listing. I tried to convert the source retrieved from my user page into xml but the conversion failed due to a missing div. I know I could do a string compare and find the text I'm looking for, but there has to be a much better way of doing this.
I want to incorporate this into a simple script that spits out my user data at the command line, and possibly expand it into a GUI application.
Web scraping with Ruby is all about finding and choosing the right gem. Considerable amounts of gems are developed to cover all steps of the web scraping process from sending HTML requests to creating CSV files. Ruby's gems, such as HTTParty and Nokogiri, are perfectly suitable for static web pages with constant URLs.
Web Scraping is used to extract useful data from websites. This extracted data can be used in many applications. Web Scraping is mainly useful in gathering data while there is no other means to collect data — eg API or feeds. Creating a Web Scraping Application using Ruby on Rails is pretty easy.
This task can be a bit difficult if you don’t have the right tools. But today you’re in luck! Because Ruby has this wonderful library called Nokogiri, which makes HTML parsing a walk in the park. Let’s see some examples.
parsed_data = Nokogiri::HTML.parse (html) puts parsed_data.title => "test" If you want to parse data directly from a URL, instead of an HTML string… This will download the HTML & get you the title. Getting the title is nice, but you probably want to see more advanced examples. Right? Let’s take a look at how to extract links from a website.
In order to send a request to any website or web app, you would need to use an HTTP client. Let's take a look at our three main options: net/http, open-uri and HTTParty. You can use whichever of the below clients you like the most and it will work with the step 2. Ruby standard library comes with an HTTP client of its own, namely, the net/http gem.
While this whole article tackles the main aspect of web scraping with Ruby, it does not talk about web scraping without getting blocked. If you want to learn how to do it, we have wrote this complete guide, and if you don't want to take care of this, you can always use our web scraping API.
Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.
require 'hpricot'
require 'open-uri'
doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i
And so forth.
Hpricot is over !
Use Nokogiri now.
try hpricot, its well... awesome
I've used it several times for screen scraping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With