Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use XPath in Nokogiri?

I have not found any documentation nor tutorial for that. Does anything like that exist?


doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr') 

The code above will get me any table, anywhere, that has a tbody child with the attribute id equal to "threadbits_forum_251". But why does it start with double //? Why there is /tr at the end? See "Ruby Nokogiri Parsing HTML table II" for more details.


Can anybody tell me how to extract href, id, alt, src, etc., using Nokogiri?

td[3]/div[1]/a/text()' <--- extracts text 

How can I extract other things?

like image 840
Radek Avatar asked Jan 17 '10 11:01

Radek


People also ask

What does Rails use Nokogiri for?

One of the best gems for Ruby on Rails is Nokogiri which is a library to deal with XML and HTML documents. The most common use for a parser like Nokogiri is to extract data from structured documents.

What does Nokogiri do?

Nokogiri (htpp://nokogiri.org/) is the most popular open source Ruby gem for HTML and XML parsing. It parses HTML and XML documents into node sets and allows for searching with CSS3 and XPath selectors. It may also be used to construct new HTML and XML objects.

What is GEM Nokogiri?

The Nokogiri gem is an incredible open-source tool that parses HTML and XML data. It is one of the most widely used gems available, and it can really take your Ruby app to another level for data with its ability to help you intuitively scrape websites.


Video Answer


2 Answers

Seems you need to read a XPath Tutorial

Your //table/tbody[@id="threadbits_forum_251"]/tr expression means:

  • // - Anywhere in your XML document
  • table/tbody - take a table element with a tbody child
  • [@id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
  • tr - and take its tr elements

So, basically, you need to know:

  • attributes begins with @
  • conditions go inside [] brackets

If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/@href if there is just one <a> element.

like image 167
Rubens Farias Avatar answered Oct 12 '22 09:10

Rubens Farias


Your XPath is correct and you seem to have answered your own question's first part (almost):

doc.xpath('//table/tbody[@id="threadbits_forum_251"]/tr') 

"the code above will get me any table table's tr, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251"


// means the following element can appear anywhere in the document.

/tr at the end means, get the tr node of the matching element.

You dont need to extract each attribute one by one. Just get the entire node containing all four attributes in Nokogiri, and get the attributes using:

theNode['href'] theNode['src'] 

Where theNode is your Nokogiri Node object.


Edit:

Sorry I haven't used these libraries, but I think the XPath evaluation and parsing is being done by Mechanize. So here's how you would get the entire element and its attributes in one go.

doc.xpath("td[3]/div[1]/a").each do |anchor|     puts anchor['href']     puts anchor['src']     ... end 
like image 33
Anurag Avatar answered Oct 12 '22 09:10

Anurag