<p>I'd like to parse a HTML page with the Nokogiri. There is a table in part of the page which does not use any specific ID. Is it possible to extract something like:</p> <pre class="prettyprint"><code>Today,3,455,34 Today,1,1300,3664 Today,10,100000,3444, Yesterday,3454,5656,3 Yesterday,3545,1000,10 Yesterday,3411,36223,15 </code></pre> <p>From this HTML:</p> <pre class="prettyprint"><code><div id="__DailyStat__"> <table> <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr> <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr> <tr class="blr"> <td>3</td> <td>455</td> <td>34</td> <td class="r">3454</td> <td class="r">5656</td> <td class="r">3</td> </tr> <tr class="bla"> <td>1</td> <td>1300</td> <td>3664</td> <td class="r">3545</td> <td class="r">1000</td> <td class="r">10</td> </tr> <tr class="blr"> <td>10</td> <td>100000</td> <td>3444</td> <td class="r">3411</td> <td class="r">36223</td> <td class="r">15</td> </tr> </table> </div> </code></pre>

<p>As a quick and dirty first pass I'd do:</p> <pre class="prettyprint"><code>html = <<EOT <div id="__DailyStat__"> <table> <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr> <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr> <tr class="blr"> <td>3</td> <td>455</td> <td>34</td> <td class="r">3454</td> <td class="r">5656</td> <td class="r">3</td> </tr> <tr class="bla"> <td>1</td> <td>1300</td> <td>3664</td> <td class="r">3545</td> <td class="r">1000</td> <td class="r">10</td> </tr> <tr class="blr"> <td>10</td> <td>100000</td> <td>3444</td> <td class="r">3411</td> <td class="r">36223</td> <td class="r">15</td> </tr> </table> </div> EOT # Today Yesterday # Qnty Size Length Length Size Qnty # 3 455 34 3454 5656 3 # 1 1300 3664 3545 1000 10 # 10 100000 3444 3411 36223 15 require 'nokogiri' doc = Nokogiri::HTML(html) </code></pre> <p>Use CSS to find the start of the table, and define some places to hold the data we're capturing:</p> <pre class="prettyprint"><code>table = doc.at('div#__DailyStat__ table') today_data = [] yesterday_data = [] </code></pre> <p>Loop over the rows in the table, rejecting the headers:</p> <pre class="prettyprint"><code>table.search('tr').each do |tr| next if (tr['class'] == 'blh') </code></pre> <p>Initialize arrays to capture the pertinent data from each row, selectively push the data into the appropriate array:</p> <pre class="prettyprint"><code> today_td_data = [ 'Today' ] yesterday_td_data = [ 'Yesterday' ] tr.search('td').each do |td| if (td['class'] == 'r') yesterday_td_data << td.text.to_i else today_td_data << td.text.to_i end end today_data << today_td_data yesterday_data << yesterday_td_data end </code></pre> <p>And output the data:</p> <pre class="prettyprint"><code>puts today_data.map{ |a| a.join(',') } puts yesterday_data.map{ |a| a.join(',') } > Today,3,455,34 > Today,1,1300,3664 > Today,10,100000,3444 > Yesterday,3454,5656,3 > Yesterday,3545,1000,10 > Yesterday,3411,36223,15 </code></pre> <p>Just to help you visualize what's going, at the exit from the "tr" loop, the <code>today_data</code> and <code>yesterday_data</code> arrays are arrays-of-arrays looking like:</p> <pre class="prettyprint"><code>[["Today", 3, 455, 34], ["Today", 1, 1300, 3664], ["Today", 10, 100000, 3444]] </code></pre> <p>Alternatively, instead of looping over the "td" tags and sensing the class for the tag, I could have grabbed the contents of the "tr" and then used <code>scan</code> to grab the numbers and sliced the resulting array into "today" and "yesterday" arrays:</p> <pre class="prettyprint"><code> tr_data = tr.text.scan(/\d+/).map{ |i| i.to_i } today_td_data = [ 'Today', *tr_data[0, 3] ] yesterday_td_data = [ 'Yesterday', *tr_data[3, 3] ] </code></pre> <p>In real-world development, like at work, I'd use that instead of what I first wrote because it's succinct.</p> <p>And notice that I didn't use XPath. It's very doable in Nokogiri to use XPath and accomplish this, but for simplicity I prefer CSS accessors. XPath would have allowed accessing individual "td" tag contents, but it also would begin to look like line-noise, which is something we want to avoid when writing code, because it impacts maintenance. I could also have used CSS to drill down to the correct "td" tags like <code>'tr td.r'</code>, but I don't think it would improve the code, it would just be an alternate way of doing it.</p>

How do I parse a plain HTML table with Nokogiri?

Tags:

html-parsing

ruby

xpath

nokogiri

I'd like to parse a HTML page with the Nokogiri. There is a table in part of the page which does not use any specific ID. Is it possible to extract something like:

Click to copy

Today,3,455,34
Today,1,1300,3664
Today,10,100000,3444,
Yesterday,3454,5656,3
Yesterday,3545,1000,10
Yesterday,3411,36223,15

From this HTML:

Click to copy

<div id="__DailyStat__">
  <table>
    <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
    <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
    <tr class="blr">
      <td>3</td>
      <td>455</td>
      <td>34</td>
      <td class="r">3454</td>
      <td class="r">5656</td>
      <td class="r">3</td>
    </tr>

    <tr class="bla">
      <td>1</td>
      <td>1300</td>
      <td>3664</td>
      <td class="r">3545</td>
      <td class="r">1000</td>
      <td class="r">10</td>
    </tr>

    <tr class="blr">
      <td>10</td>
      <td>100000</td>
      <td>3444</td>
      <td class="r">3411</td>
      <td class="r">36223</td>
      <td class="r">15</td>
    </tr>
  </table>
</div>

878

asked Jun 04 '11 15:06

JraNil

1 Answers

As a quick and dirty first pass I'd do:

Click to copy

html = <<EOT
<div id="__DailyStat__">
  <table>
    <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
    <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
    <tr class="blr">
      <td>3</td>
      <td>455</td>
      <td>34</td>
      <td class="r">3454</td>
      <td class="r">5656</td>
      <td class="r">3</td>
    </tr>

    <tr class="bla">
      <td>1</td>
      <td>1300</td>
      <td>3664</td>
      <td class="r">3545</td>
      <td class="r">1000</td>
      <td class="r">10</td>
    </tr>

    <tr class="blr">
      <td>10</td>
      <td>100000</td>
      <td>3444</td>
      <td class="r">3411</td>
      <td class="r">36223</td>
      <td class="r">15</td>
    </tr>
  </table>
</div>
EOT

#    Today              Yesterday
#    Qnty Size   Length Length Size  Qnty
#    3    455    34     3454   5656  3
#    1    1300   3664   3545   1000  10
#    10   100000 3444   3411   36223 15


require 'nokogiri'

doc = Nokogiri::HTML(html)

Use CSS to find the start of the table, and define some places to hold the data we're capturing:

Click to copy

table = doc.at('div#__DailyStat__ table')

today_data     = []
yesterday_data = []

Loop over the rows in the table, rejecting the headers:

Click to copy

table.search('tr').each do |tr|

  next if (tr['class'] == 'blh')

Initialize arrays to capture the pertinent data from each row, selectively push the data into the appropriate array:

Click to copy

  today_td_data     = [ 'Today'     ]
  yesterday_td_data = [ 'Yesterday' ]

  tr.search('td').each do |td|
    if (td['class'] == 'r')
      yesterday_td_data << td.text.to_i
    else
      today_td_data << td.text.to_i
    end
  end

  today_data     << today_td_data
  yesterday_data << yesterday_td_data

end

And output the data:

Click to copy

puts today_data.map{ |a| a.join(',') }
puts yesterday_data.map{ |a| a.join(',') }

> Today,3,455,34
> Today,1,1300,3664
> Today,10,100000,3444
> Yesterday,3454,5656,3
> Yesterday,3545,1000,10
> Yesterday,3411,36223,15

Just to help you visualize what's going, at the exit from the "tr" loop, the today_data and yesterday_data arrays are arrays-of-arrays looking like:

Click to copy

[["Today", 3, 455, 34], ["Today", 1, 1300, 3664], ["Today", 10, 100000, 3444]]

Alternatively, instead of looping over the "td" tags and sensing the class for the tag, I could have grabbed the contents of the "tr" and then used scan to grab the numbers and sliced the resulting array into "today" and "yesterday" arrays:

Click to copy

  tr_data = tr.text.scan(/\d+/).map{ |i| i.to_i }

  today_td_data     = [ 'Today',     *tr_data[0, 3] ]
  yesterday_td_data = [ 'Yesterday', *tr_data[3, 3] ]

In real-world development, like at work, I'd use that instead of what I first wrote because it's succinct.

And notice that I didn't use XPath. It's very doable in Nokogiri to use XPath and accomplish this, but for simplicity I prefer CSS accessors. XPath would have allowed accessing individual "td" tag contents, but it also would begin to look like line-noise, which is something we want to avoid when writing code, because it impacts maintenance. I could also have used CSS to drill down to the correct "td" tags like 'tr td.r', but I don't think it would improve the code, it would just be an alternate way of doing it.

124

answered Sep 21 '22 12:09

the Tin Man

Related questions
                            
                                Best practice for adding a Ruby extension methods to Rails 3?
                            
                                how to call ruby script from php?
                            
                                Does ActiveMerchant support Subscription Based transaction
                            
                                Ruby cannot find required libraries even though gem is installed
                            
                                What is update method for Rails Associations?
                            
                                How do I mock an object in this case? no obvious way to replace object with mock
                            
                                How can I do to write "Text" just once and in the same time check if the path_info includes 'A'?
                            
                                Rails EOF Error when using HTTP.get_response to retrieve Facebook access token
                            
                                Getting hash with symbol as keys for mongo in rails
                            
                                Ruby: Run script from bash script?
                            
                                Rails - What is the design pattern to subclass a model?
                            
                                I'm learning to program and have chosen Ruby. Should I upgrade to Ruby 1.9?
                            
                                Ruby - how to handle problem of subclass accidentally overriding superclass's private fields?
                            
                                Problems Calling a Java Class from JRuby
                            
                                How to use Sinatra's Haml-helper inside a model?
                            
                                How could I remove the last character from a string if it is a punctuation, in ruby?
                            
                                Mechanize breaks on ASP page
                            
                                ruby hash memory leak after key deletion
                            
                                How do I search, increment, and replace integer substrings in a Ruby string?
                            
                                Is it possible to configure the IRB prompt to change dynamically?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I parse a plain HTML table with Nokogiri?

Tags:

html-parsing

ruby

xpath

nokogiri

JraNil

People also ask

1 Answers

the Tin Man

Recent Activity

Donate For Us