I installed Ruby and Mechanize. It seems to me that it is posible in Nokogiri to do what I want to do but I do not know how to do it.
What about this table
? It is just part of the HTML of a vBulletin forum site. I tried to keep the HTML structure but delete some text and tag attributes. I want to get some details per thread like: Title, Author, Date, Time, Replies, and Views.
Please note that there are few tables in the HTML document? I am after one particular table with its tbody
, <tbody id="threadbits_forum_251">
. The name will be always the same (I hope). Can I use the tbody
and the name
in the code?
<table > <tbody> <tr> <!-- table header --> </tr> </tbody> <!-- show threads --> <tbody id="threadbits_forum_251"> <tr> <td></td> <td></td> <td> <div> <a href="showthread.php?t=230708" >Vb4 Gold Released</a> </div> <div> <span><a>Paul M</a></span> </div> </td> <td> 06 Jan 2010 <span class="time">23:35</span><br /> by <a href="member.php?find=lastposter&t=230708">shane943</a> </div> </td> <td><a href="#">24</a></td> <td>1,320</td> </tr> </tbody> </table>
Nokogiri (htpp://nokogiri.org/) is the most popular open source Ruby gem for HTML and XML parsing. It parses HTML and XML documents into node sets and allows for searching with CSS3 and XPath selectors. It may also be used to construct new HTML and XML objects.
To parse XML-documents, I recommend the gem nokogiri .
Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (CRuby) and xerces (JRuby).
#!/usr/bin/ruby1.8 require 'nokogiri' require 'pp' html = <<-EOS (The HTML from the question goes here) EOS doc = Nokogiri::HTML(html) rows = doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr') details = rows.collect do |row| detail = {} [ [:title, 'td[3]/div[1]/a/text()'], [:name, 'td[3]/div[2]/span/a/text()'], [:date, 'td[4]/text()'], [:time, 'td[4]/span/text()'], [:number, 'td[5]/a/text()'], [:views, 'td[6]/text()'], ].each do |name, xpath| detail[name] = row.at_xpath(xpath).to_s.strip end detail end pp details # => [{:time=>"23:35", # => :title=>"Vb4 Gold Released", # => :number=>"24", # => :date=>"06 Jan 2010", # => :views=>"1,320", # => :name=>"Paul M"}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With