Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I parse an HTML table with Nokogiri?

I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.

What about this table? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.

Please note that there are few tables in the HTML document? I am after one particular table with its tbody, <tbody id="threadbits_forum_251">. The name will be always the same (I hope). Can I use the tbody and the name in the code?

<table >   <tbody>     <tr>  <!-- table header --> </tr>   </tbody>   <!-- show threads -->   <tbody id="threadbits_forum_251">     <tr>       <td></td>       <td></td>       <td>         <div>           <a href="showthread.php?t=230708" >Vb4 Gold Released</a>         </div>         <div>           <span><a>Paul M</a></span>         </div>       </td>       <td>           06 Jan 2010 <span class="time">23:35</span><br />           by <a href="member.php?find=lastposter&amp;t=230708">shane943</a>          </div>       </td>       <td><a href="#">24</a></td>       <td>1,320</td>     </tr>    </tbody> </table> 
like image 536
Radek Avatar asked Jan 14 '10 03:01

Radek


People also ask

What is the use of Nokogiri?

Nokogiri (htpp://nokogiri.org/) is the most popular open source Ruby gem for HTML and XML parsing. It parses HTML and XML documents into node sets and allows for searching with CSS3 and XPath selectors. It may also be used to construct new HTML and XML objects.

Which gem is used to parse a .XML or .HTML document?

To parse XML-documents, I recommend the gem nokogiri .

What is Nokogiri Ruby?

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (CRuby) and xerces (JRuby).


1 Answers

#!/usr/bin/ruby1.8  require 'nokogiri' require 'pp'  html = <<-EOS   (The HTML from the question goes here) EOS  doc = Nokogiri::HTML(html) rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr') details = rows.collect do |row|   detail = {}   [     [:title, 'td[3]/div[1]/a/text()'],     [:name, 'td[3]/div[2]/span/a/text()'],     [:date, 'td[4]/text()'],     [:time, 'td[4]/span/text()'],     [:number, 'td[5]/a/text()'],     [:views, 'td[6]/text()'],   ].each do |name, xpath|     detail[name] = row.at_xpath(xpath).to_s.strip   end   detail end pp details  # => [{:time=>"23:35", # =>   :title=>"Vb4 Gold Released", # =>   :number=>"24", # =>   :date=>"06 Jan 2010", # =>   :views=>"1,320", # =>   :name=>"Paul M"}] 
like image 160
Wayne Conrad Avatar answered Oct 03 '22 20:10

Wayne Conrad