How do I parse an HTML table with Nokogiri?

Tags:

I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.

What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.

Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?

<table >   <tbody>     <tr>  <!-- table header --> </tr>   </tbody>   <!-- show threads -->   <tbody id="threadbits_forum_251">     <tr>       <td></td>       <td></td>       <td>         <div>           <a href="showthread.php?t=230708" >Vb4 Gold Released</a>         </div>         <div>           <span><a>Paul M</a></span>         </div>       </td>       <td>           06 Jan 2010 <span class="time">23:35</span><br />           by <a href="member.php?find=lastposter&amp;t=230708">shane943</a>          </div>       </td>       <td><a href="#">24</a></td>       <td>1,320</td>     </tr>    </tbody> </table>

536

asked Jan 14 '10 03:01

Radek

1 Answers

#!/usr/bin/ruby1.8  require 'nokogiri' require 'pp'  html = <<-EOS   (The HTML from the question goes here) EOS  doc = Nokogiri::HTML(html) rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr') details = rows.collect do |row|   detail = {}   [     [:title, 'td[3]/div[1]/a/text()'],     [:name, 'td[3]/div[2]/span/a/text()'],     [:date, 'td[4]/text()'],     [:time, 'td[4]/span/text()'],     [:number, 'td[5]/a/text()'],     [:views, 'td[6]/text()'],   ].each do |name, xpath|     detail[name] = row.at_xpath(xpath).to_s.strip   end   detail end pp details  # => [{:time=>"23:35", # =>   :title=>"Vb4 Gold Released", # =>   :number=>"24", # =>   :date=>"06 Jan 2010", # =>   :views=>"1,320", # =>   :name=>"Paul M"}]

160

answered Oct 03 '22 20:10

Wayne Conrad

Related questions
                            
                                Why does a diamond with a questionmark in it � appear in my HTML?
                            
                                importScripts (web workers)
                            
                                How can I get horizontal scrollbars at top and bottom of a div?
                            
                                Add/remove CSS will cause IE9 to increase the table's height
                            
                                Is there a way to make robots ignore certain text?
                            
                                Python follow redirects and then download the page?
                            
                                for input type="number" how to set default value to 0
                            
                                Close Current Tab
                            
                                HTML Input cursor position issue in Chrome when value is empty
                            
                                How to animate GIFs in HTML document?
                            
                                How does the live, real-time typing work in Google Wave?
                            
                                Are <script>'s not in <head> ok?
                            
                                How to avoid automatic focus on first input field when popping a HTML form as a JQuery dialog?
                            
                                Parsing HTML page with HtmlAgilityPack
                            
                                Calling methods in RequireJs modules from HTML elements such as onclick handlers
                            
                                Set textarea value with javascript after TinyMCE initializing
                            
                                HTML5 syntax - HTML vs XHTML
                            
                                Convert RGBA to HEX
                            
                                <input type="number"> not working in IE10
                            
                                Gmail blocking small embedded inline images in email template

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I parse an HTML table with Nokogiri?

Tags:

html

html-table

ruby

nokogiri

mechanize

Radek

People also ask

1 Answers

Wayne Conrad

Recent Activity

Donate For Us