Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I parse a plain HTML table with Nokogiri?

I'd like to parse a HTML page with the Nokogiri. There is a table in part of the page which does not use any specific ID. Is it possible to extract something like:

Today,3,455,34
Today,1,1300,3664
Today,10,100000,3444,
Yesterday,3454,5656,3
Yesterday,3545,1000,10
Yesterday,3411,36223,15

From this HTML:

<div id="__DailyStat__">
  <table>
    <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
    <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
    <tr class="blr">
      <td>3</td>
      <td>455</td>
      <td>34</td>
      <td class="r">3454</td>
      <td class="r">5656</td>
      <td class="r">3</td>
    </tr>

    <tr class="bla">
      <td>1</td>
      <td>1300</td>
      <td>3664</td>
      <td class="r">3545</td>
      <td class="r">1000</td>
      <td class="r">10</td>
    </tr>

    <tr class="blr">
      <td>10</td>
      <td>100000</td>
      <td>3444</td>
      <td class="r">3411</td>
      <td class="r">36223</td>
      <td class="r">15</td>
    </tr>
  </table>
</div>
like image 878
JraNil Avatar asked Jun 04 '11 15:06

JraNil


People also ask

What is the use of Nokogiri?

Nokogiri (htpp://nokogiri.org/) is the most popular open source Ruby gem for HTML and XML parsing. It parses HTML and XML documents into node sets and allows for searching with CSS3 and XPath selectors. It may also be used to construct new HTML and XML objects.

What is Nokogiri Ruby?

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (C) and xerces (Java).

Which gem is used to parse a .XML or .HTML document?

To parse XML-documents, I recommend the gem nokogiri .


1 Answers

As a quick and dirty first pass I'd do:

html = <<EOT
<div id="__DailyStat__">
  <table>
    <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
    <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
    <tr class="blr">
      <td>3</td>
      <td>455</td>
      <td>34</td>
      <td class="r">3454</td>
      <td class="r">5656</td>
      <td class="r">3</td>
    </tr>

    <tr class="bla">
      <td>1</td>
      <td>1300</td>
      <td>3664</td>
      <td class="r">3545</td>
      <td class="r">1000</td>
      <td class="r">10</td>
    </tr>

    <tr class="blr">
      <td>10</td>
      <td>100000</td>
      <td>3444</td>
      <td class="r">3411</td>
      <td class="r">36223</td>
      <td class="r">15</td>
    </tr>
  </table>
</div>
EOT

#    Today              Yesterday
#    Qnty Size   Length Length Size  Qnty
#    3    455    34     3454   5656  3
#    1    1300   3664   3545   1000  10
#    10   100000 3444   3411   36223 15


require 'nokogiri'

doc = Nokogiri::HTML(html)

Use CSS to find the start of the table, and define some places to hold the data we're capturing:

table = doc.at('div#__DailyStat__ table')

today_data     = []
yesterday_data = []

Loop over the rows in the table, rejecting the headers:

table.search('tr').each do |tr|

  next if (tr['class'] == 'blh')

Initialize arrays to capture the pertinent data from each row, selectively push the data into the appropriate array:

  today_td_data     = [ 'Today'     ]
  yesterday_td_data = [ 'Yesterday' ]

  tr.search('td').each do |td|
    if (td['class'] == 'r')
      yesterday_td_data << td.text.to_i
    else
      today_td_data << td.text.to_i
    end
  end

  today_data     << today_td_data
  yesterday_data << yesterday_td_data

end

And output the data:

puts today_data.map{ |a| a.join(',') }
puts yesterday_data.map{ |a| a.join(',') }

> Today,3,455,34
> Today,1,1300,3664
> Today,10,100000,3444
> Yesterday,3454,5656,3
> Yesterday,3545,1000,10
> Yesterday,3411,36223,15

Just to help you visualize what's going, at the exit from the "tr" loop, the today_data and yesterday_data arrays are arrays-of-arrays looking like:

[["Today", 3, 455, 34], ["Today", 1, 1300, 3664], ["Today", 10, 100000, 3444]]

Alternatively, instead of looping over the "td" tags and sensing the class for the tag, I could have grabbed the contents of the "tr" and then used scan to grab the numbers and sliced the resulting array into "today" and "yesterday" arrays:

  tr_data = tr.text.scan(/\d+/).map{ |i| i.to_i }

  today_td_data     = [ 'Today',     *tr_data[0, 3] ]
  yesterday_td_data = [ 'Yesterday', *tr_data[3, 3] ]

In real-world development, like at work, I'd use that instead of what I first wrote because it's succinct.

And notice that I didn't use XPath. It's very doable in Nokogiri to use XPath and accomplish this, but for simplicity I prefer CSS accessors. XPath would have allowed accessing individual "td" tag contents, but it also would begin to look like line-noise, which is something we want to avoid when writing code, because it impacts maintenance. I could also have used CSS to drill down to the correct "td" tags like 'tr td.r', but I don't think it would improve the code, it would just be an alternate way of doing it.

like image 124
the Tin Man Avatar answered Sep 21 '22 12:09

the Tin Man