Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I search for "text" then traverse the DOM from the found node?

Tags:

ruby

nokogiri

I have webpage that I need to scrape some data from. The problem is, each page may or may not have specific data, or it may have extra data above or below it in the DOM, and there is no CSS ids to speak of.

Typically I could use either CSS ids or XPath to get to the node I'm looking for. I don't have that option in this case. What I'm trying to do is search for the "label" text then grab the data in the next <TD> node:

<tr> 
    <td><b>Name:</b></td> 
    <td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small></td> 
</tr>

In the above HTML, I would search for:

doc.search("[text()*='Name:']")

to get the node just before the data I need, but I'm not sure how to navigate from there.

like image 546
Nick Faraday Avatar asked Apr 25 '11 03:04

Nick Faraday


People also ask

How do you traverse a DOM?

We can also traverse up the DOM tree, using the parentNode property. while (node = node. parentNode) DoSomething(node); This will traverse the DOM tree until it reaches the root element, when the parentNode property becomes null.

Which is method to search a node from the DOM?

The Selectors API provides methods that make it quick and easy to retrieve Element nodes from the DOM by matching against a set of selectors. This is much faster than past techniques, wherein it was necessary to, for example, use a loop in JavaScript code to locate the specific items you needed to find.

Which method is used for traversing DOM tree?

parentsUntil() method This method traverses all the way up the DOM tree until the specified ancestor is found. Then the method returns all the elements between the given element and the specified ancestor.


1 Answers

next_element is probably the method you're looking for.

require 'nokogiri'

data = File.read "html.htm"

doc  = Nokogiri::HTML data

els  = doc.search "[text()*='Name:']"
el   = els.first

puts "Found element:"
puts el
puts

puts "Parent element:"
puts el.parent
puts

puts "Parent's next_element():"
puts el.parent.next_element

# Output:
#
# Found element:
# <b>Name:</b>
#
# Parent element:
# <td> 
#     <b>Name:</b>
# </td>
#
# Parent's next_element():
# <td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small>
# </td>

Note that since the text is inside <b></b> tags, you have to go up a level (to the found element's parent <td>) before you can get to the next sibling. If the HTML structure isn't stable, you'd have to find the first parent that is a <td> and go from there.

like image 52
Michelle Tilley Avatar answered Oct 12 '22 22:10

Michelle Tilley