Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use Nokogiri and Ruby to scrape values from HTML with nested tables?

I am trying to extract the name, ID, Phone, Email, Gender, Ethnicity, DOB, Class, Major, School and GPA from a page I am parsing with Nokogiri.

I tried some different xpath's but everything I try grabs much more than I want:

<span class="subTitle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
      <table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
      <table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
          <td bgcolor="#dddddd">Some Person</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
          <td bgcolor="#dddddd">A12345678</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
          <td bgcolor="#dddddd">123-456-7890</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
          <td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
          <td bgcolor="#dddddd">[email protected]</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
          <td bgcolor="#dddddd">Female</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
          <td bgcolor="#dddddd">Unknown</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
          <td bgcolor="#dddddd">Jan 1st, 1901</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
          <td bgcolor="#dddddd">Sophomore</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
          <td bgcolor="#dddddd">Biology</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
          <td bgcolor="#dddddd">University of Somewhere</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
          <td bgcolor="#dddddd">0.00</td>
        </tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
          <td bgcolor="#dddddd">
      <table border="0" cellspacing="0" cellpadding="0">
<tr>
like image 478
Sean Avatar asked May 13 '11 20:05

Sean


People also ask

What is Nokogiri used for?

Nokogiri (htpp://nokogiri.org/) is the most popular open source Ruby gem for HTML and XML parsing. It parses HTML and XML documents into node sets and allows for searching with CSS3 and XPath selectors. It may also be used to construct new HTML and XML objects.

Why does rails need Nokogiri?

One of the best gems for Ruby on Rails is Nokogiri which is a library to deal with XML and HTML documents. The most common use for a parser like Nokogiri is to extract data from structured documents. Examples: A list of prices from a price comparison website.


1 Answers

I assume that there will be many "Recruit Profile" spans that are followed by tables that wrap up all the details. The following method takes your entire HTML page, finds just those spans, and for each of them it finds the following table and then finds the fields you want anywhere below that table:

require 'nokogiri'

# Pass in or set the array of labels you want to use
# Returns an array of hashes mapping these labels to the values
def recruits_details(html,fields=%W[Name #{"EDU ID"} Phone Email Gender])
  doc = Nokogiri::HTML(html)
  recruit_labels = doc.xpath('//span[b[text()="Recruit Profile"]]')
  recruit_labels.map do |recruit_label|
    recruit_table = recruit_label.at_xpath('following-sibling::table')
    Hash[ fields.map do |field_label|
      label_td = recruit_table.at_xpath(".//td[b[text()='#{field_label}']]")
      [field_label, label_td.at_xpath('following-sibling::td/text()').text ]
    end ]
  end
end

require 'pp'
pp recruits_details(html_string)
#=> [{"Name"=>"Some Person",
#=>   "EDU ID"=>"A12345678",
#=>   "Phone"=>"123-456-7890",
#=>   "Email"=>"[email protected]",
#=>   "Gender"=>"Female"}]

An XPath expression like .//foo[bar[text()="jim"]] means:

  • Find a 'foo' element anywhere under the current node
  • ...but only if it has a 'bar' element as a child
  • ...but only if that 'bar' element has the text "jim" as its content

An XPath expression like following-sibling::... means Find any elements that are siblings after the current node that match the expression ...

The XPath expression .../text() selects the Text node; the text method is used to extract the value (actual string) of that text node.

Nokogiri's xpath method returns an array of all elements matching the expression, while the at_xpath method returns the first element matching the expression.

like image 124
Phrogz Avatar answered Nov 15 '22 08:11

Phrogz