Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby Mechanize get elements with specified text

I am trying to parse the contents of a website using mechanize and I am stuck at a point. The content that I want to parse is inside a li tag and is not always in the same order.

Lets suppose that we have the following where the order of li tags is not always the same and some times some may not even be there at all.

<div class="details">
  <ul>
    <li><span>title 1</span> ": here are the details"</li>
    <li><span>title 2</span> ": here are the details"</li>
    <li><span>title 3</span> ": here are the details"</li>
    <li><span>title 4</span> ": here are the details"</li>
  </ul>
</div>

What I want is to get only the li details where the span text is for example title 3. What I have done is the following which gives me the details from the first li:

puts page.at('.details').at('span', :text => "title 3").at("+ *").text

Is there a way to do what I want using mechanize or should I also use other means?

like image 684
George Karanikas Avatar asked Sep 27 '13 10:09

George Karanikas


3 Answers

page.search(".details").at("span:contains('title 3')").parent.text

Explanation: With at you can use css or xpath selector. In order to make more readable and similar to your approach, this answer use css selector, but the problem is that CSS cannot perform selection based on text. Thanks to Nokogiri, you can use use JQuery selector, so the contains methods is allow.

The selection get the span element, so if you want to get the li element parent, you can use parent methods and then get the text easily.

like image 136
Rodri_gore Avatar answered Sep 22 '22 20:09

Rodri_gore


Since you're looking to do this using Mechanize (and I see one of the comments recommend using Nokogiri instead) you should be aware that Mechanize is built on Nokogiri, so you're actually able to use any/all Nokogiri functionality through Mechanize.

To show you from the docs at http://mechanize.rubyforge.org/Mechanize.html

Mechanize.html_parser = Nokogiri::XML

So you can accomplish this using XPath and the mechanize page.search method.

page.search("//div[@class='details']/ul/li[span='title 3']").text

This should be able to give you the text for the li element that you're looking for. (unverified with .text, but the XPath does work)

You can test the XPath here: http://www.xpathtester.com/saved/51c5142c-dbef-4206-8fbc-1ba567373fb2

like image 41
Jeff LaJoie Avatar answered Sep 21 '22 20:09

Jeff LaJoie


A cleaner css approach:

page.at('.details li:has(span[text()="title 3"])')
like image 35
pguardiario Avatar answered Sep 19 '22 20:09

pguardiario