I am trying to parse the contents of a website using mechanize and I am stuck at a point. The content that I want to parse is inside a li
tag and is not always in the same order.
Lets suppose that we have the following where the order of li
tags is not always the same and some times some may not even be there at all.
<div class="details">
<ul>
<li><span>title 1</span> ": here are the details"</li>
<li><span>title 2</span> ": here are the details"</li>
<li><span>title 3</span> ": here are the details"</li>
<li><span>title 4</span> ": here are the details"</li>
</ul>
</div>
What I want is to get only the li
details where the span
text is for example title 3
. What I have done is the following which gives me the details from the first li
:
puts page.at('.details').at('span', :text => "title 3").at("+ *").text
Is there a way to do what I want using mechanize or should I also use other means?
page.search(".details").at("span:contains('title 3')").parent.text
Explanation: With at you can use css or xpath selector. In order to make more readable and similar to your approach, this answer use css selector, but the problem is that CSS cannot perform selection based on text. Thanks to Nokogiri, you can use use JQuery selector, so the contains methods is allow.
The selection get the span element, so if you want to get the li element parent, you can use parent methods and then get the text easily.
Since you're looking to do this using Mechanize (and I see one of the comments recommend using Nokogiri instead) you should be aware that Mechanize is built on Nokogiri, so you're actually able to use any/all Nokogiri functionality through Mechanize.
To show you from the docs at http://mechanize.rubyforge.org/Mechanize.html
Mechanize.html_parser = Nokogiri::XML
So you can accomplish this using XPath and the mechanize page.search method.
page.search("//div[@class='details']/ul/li[span='title 3']").text
This should be able to give you the text for the li element that you're looking for. (unverified with .text, but the XPath does work)
You can test the XPath here: http://www.xpathtester.com/saved/51c5142c-dbef-4206-8fbc-1ba567373fb2
A cleaner css
approach:
page.at('.details li:has(span[text()="title 3"])')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With