I have webpage that I need to scrape some data from. The problem is, each page may or may not have specific data, or it may have extra data above or below it in the DOM, and there is no CSS ids to speak of.
Typically I could use either CSS ids or XPath to get to the node I'm looking for. I don't have that option in this case. What I'm trying to do is search for the "label" text then grab the data in the next <TD>
node:
<tr>
<td><b>Name:</b></td>
<td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small></td>
</tr>
In the above HTML, I would search for:
doc.search("[text()*='Name:']")
to get the node just before the data I need, but I'm not sure how to navigate from there.
We can also traverse up the DOM tree, using the parentNode property. while (node = node. parentNode) DoSomething(node); This will traverse the DOM tree until it reaches the root element, when the parentNode property becomes null.
The Selectors API provides methods that make it quick and easy to retrieve Element nodes from the DOM by matching against a set of selectors. This is much faster than past techniques, wherein it was necessary to, for example, use a loop in JavaScript code to locate the specific items you needed to find.
parentsUntil() method This method traverses all the way up the DOM tree until the specified ancestor is found. Then the method returns all the elements between the given element and the specified ancestor.
next_element
is probably the method you're looking for.
require 'nokogiri'
data = File.read "html.htm"
doc = Nokogiri::HTML data
els = doc.search "[text()*='Name:']"
el = els.first
puts "Found element:"
puts el
puts
puts "Parent element:"
puts el.parent
puts
puts "Parent's next_element():"
puts el.parent.next_element
# Output:
#
# Found element:
# <b>Name:</b>
#
# Parent element:
# <td>
# <b>Name:</b>
# </td>
#
# Parent's next_element():
# <td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small>
# </td>
Note that since the text is inside <b></b>
tags, you have to go up a level (to the found element's parent <td>
) before you can get to the next sibling. If the HTML structure isn't stable, you'd have to find the first parent that is a <td>
and go from there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With