What is the best way to parse a web page in Ruby?

Tags:

I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how can I get the data into a usable format?

Say I want to parse my own user page for my current reputation score and badge listing. I tried to convert the source retrieved from my user page into xml but the conversion failed due to a missing div. I know I could do a string compare and find the text I'm looking for, but there has to be a much better way of doing this.

I want to incorporate this into a simple script that spits out my user data at the command line, and possibly expand it into a GUI application.

404

asked Sep 26 '08 03:09

Jeremy Mack

3 Answers

Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.

require 'hpricot'
require 'open-uri'

doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i

And so forth.

104

answered Oct 21 '22 18:10

Armin Ronacher

Hpricot is over !

Use Nokogiri now.

answered Oct 21 '22 18:10

AnkitG

try hpricot, its well... awesome

I've used it several times for screen scraping.

answered Oct 21 '22 17:10

ethyreal

Related questions
                            
                                Find out what HTML elements are forcing the page/block/section to be wide/tall
                            
                                How to get all unchecked radio buttons
                            
                                IndexedDB performance
                            
                                Outline table row on hover
                            
                                List of html elements that support the CSS :before and :after pseudo elements
                            
                                How to force Iframe to run quirks under a standard parent frame
                            
                                How to Use select2.js
                            
                                How can I make my footer center to the bottom of the page?
                            
                                Animating mouse leave with CSS and jQuery?
                            
                                Two DIVs inside DIV. How to auto-fill the space of parent DIV by second DIV?
                            
                                Is call to preventDefault() really necessary on drop event?
                            
                                Drop-down box dependent on the option selected in another drop-down box
                            
                                Simulate drop file event
                            
                                HTML - read .txt file from URL location in javascript
                            
                                padding not working in span tag
                            
                                How to show or hide a menu when I scroll down or up?
                            
                                Uncaught Error: ReCAPTCHA placeholder element must be an element or id
                            
                                why some of the font-awesome icons does not show
                            
                                How set up Angular Material Footer via Flex-Layout
                            
                                How to disable "related videos" from an embedded youtube playlist

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the best way to parse a web page in Ruby?

Tags:

html

xml

ruby

screen-scraping