How to navigate the DOM using Nokogiri

I'm trying to fill the variables parent_element_h1 and parent_element_h2. Can anyone help me use Nokogiri to get the information I need into those variables?

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <h1>Foo</h1>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <h2>Bar</h2>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
start_here = parent.at('div.block#X2')

# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
parent_element_h1 = 

# this should be a Nokogiri::XML::Element of the nearest, previous h2. 
# in this example it's the one with the value 'Bar'
parent_element_h2 =

Please note: The start_here element could be anywhere inside the document. The HTML data is just an example. That said, the headers <h1> and <h2> could be a sibling of start_here or a child of a sibling of start_here.

The following recursive method is a good starting point, but it doesn't work on <h1> because it's a child of a sibling of start_here:

def search_element(_block,_style)
  unless _block.nil?
    if _block.name == _style
      return _block
    else
      search_element(_block.previous,_style)
    end
  else
    return false
  end
end

parent_element_h1 = search_element(start_here,'h1')
parent_element_h2 = search_element(start_here,'h2')

After accepting an answer, I came up with my own solution. It works like a charm and I think it's pretty cool.

How does Nokogiri work?

Nokogiri makes an attempt to determine whether a CSS or XPath selector is being passed in. It's possible to create a selector that fools at or search so occasionally it will misunderstand, which is why we have the more specific versions of the methods.

What is Nokogiri gem used for?

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (C) and xerces (Java).

Why does rails need Nokogiri?

Nokogiri is a dependency of rails-dom-testing which is required by Rails. As far as I see it rails-dom-testing is used to verify certain parts of a rendered HTML/CSS page. Nokogiri can be a great tool, but it's also a 800-pound gorilla. It's unpleasant that it's a Rails dependency IMHO.

The approach I would take (if I am understanding your problem) is to use XPath or CSS to search for your "start_here" element and the parent element that you want to search under. Then, recursively walk the tree starting at parent, stopping when you hit the "start_here" element, and holding onto the last element that matches your style along the way.

Something like:

parent = value.search("//body").first
div = value.search("//div[@id = 'X2']").first

find = FindPriorTo.new(div)

assert_equal('Foo', find.find_from(parent, 'h1').text)
assert_equal('Bar', find.find_from(parent, 'h2').text)

Where FindPriorTo is a simple class to handle the recursion:

class FindPriorTo
  def initialize(stop_element)
    @stop_element = stop_element
  end

  def find_from(parent, style)
    @should_stop = nil
    @last_style  = nil

    recursive_search(parent, style)
  end

  def recursive_search(parent, style)
    parent.children.each do |ch|
      recursive_search(ch, style)
      return @last_style if @should_stop

      @should_stop = (ch == @stop_element)
      @last_style = ch if ch.name == style
    end

    @last_style    
  end

end

If this approach isn't scalable enough, then you might be able to optimize things by rewriting the recursive_search to not use recursion, and also pass in both of the styles you are looking for and keep track of last found, so you don't have to traverse the tree an extra time.

I'd also say try monkey patching Node to hook on when the document is getting parsed, but it looks like all of that is written in C. Perhaps you might be better served using something other than Nokogiri that has a native Ruby SAX parser (maybe REXML), or if speed is your real concern, do the search portion in C/C++ using Xerces or similar. I don't know how well these will deal with parsing HTML though.

I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.

It's a single statement with XPath:

start = doc.at('div.block#X2')

start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>    

start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>

This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last() predicate ensures that you get the closest previous match.

Maybe this will do it. I'm not sure about the performance and if there might be some cases that I haven't thought of.

def find(root, start, tag)
    ps, res = start, nil
    until res or (ps == root)
        ps  = ps.previous || ps.parent
        res = ps.css(tag).last
        res ||= ps.name == tag ? ps : nil
    end
    res || "Not found!"
end

parent_element_h1 =  find(parent, start_here, 'h1')

How to navigate the DOM using Nokogiri

Tags:

dom

ruby

ruby-on-rails

xpath

nokogiri

Javier

People also ask

3 Answers

Aaron Hinni

Mark Thomas

sris

Recent Activity

Donate For Us

How to navigate the DOM using Nokogiri

Tags:

dom

ruby

ruby-on-rails

xpath

nokogiri

Javier

People also ask

3 Answers

Aaron Hinni

Mark Thomas

sris

Related questions

Recent Activity

Donate For Us