Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to navigate the DOM using Nokogiri

I'm trying to fill the variables parent_element_h1 and parent_element_h2. Can anyone help me use Nokogiri to get the information I need into those variables?

require 'rubygems'
require 'nokogiri'

value = Nokogiri::HTML.parse(<<-HTML_END)
  "<html>
    <body>
      <p id='para-1'>A</p>
      <div class='block' id='X1'>
        <h1>Foo</h1>
        <p id='para-2'>B</p>
      </div>
      <p id='para-3'>C</p>
      <h2>Bar</h2>
      <p id='para-4'>D</p>
      <p id='para-5'>E</p>
      <div class='block' id='X2'>
        <p id='para-6'>F</p>
      </div>
    </body>
  </html>"
HTML_END

parent = value.css('body').first

# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
start_here = parent.at('div.block#X2')

# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
parent_element_h1 = 

# this should be a Nokogiri::XML::Element of the nearest, previous h2. 
# in this example it's the one with the value 'Bar'
parent_element_h2 =

Please note: The start_here element could be anywhere inside the document. The HTML data is just an example. That said, the headers <h1> and <h2> could be a sibling of start_here or a child of a sibling of start_here.

The following recursive method is a good starting point, but it doesn't work on <h1> because it's a child of a sibling of start_here:

def search_element(_block,_style)
  unless _block.nil?
    if _block.name == _style
      return _block
    else
      search_element(_block.previous,_style)
    end
  else
    return false
  end
end

parent_element_h1 = search_element(start_here,'h1')
parent_element_h2 = search_element(start_here,'h2')

After accepting an answer, I came up with my own solution. It works like a charm and I think it's pretty cool.

like image 998
Javier Avatar asked Mar 18 '09 09:03

Javier


People also ask

How does Nokogiri work?

Nokogiri makes an attempt to determine whether a CSS or XPath selector is being passed in. It's possible to create a selector that fools at or search so occasionally it will misunderstand, which is why we have the more specific versions of the methods.

What is Nokogiri gem used for?

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (C) and xerces (Java).

Why does rails need Nokogiri?

Nokogiri is a dependency of rails-dom-testing which is required by Rails. As far as I see it rails-dom-testing is used to verify certain parts of a rendered HTML/CSS page. Nokogiri can be a great tool, but it's also a 800-pound gorilla. It's unpleasant that it's a Rails dependency IMHO.


3 Answers

The approach I would take (if I am understanding your problem) is to use XPath or CSS to search for your "start_here" element and the parent element that you want to search under. Then, recursively walk the tree starting at parent, stopping when you hit the "start_here" element, and holding onto the last element that matches your style along the way.

Something like:

parent = value.search("//body").first
div = value.search("//div[@id = 'X2']").first

find = FindPriorTo.new(div)

assert_equal('Foo', find.find_from(parent, 'h1').text)
assert_equal('Bar', find.find_from(parent, 'h2').text) 

Where FindPriorTo is a simple class to handle the recursion:

class FindPriorTo
  def initialize(stop_element)
    @stop_element = stop_element
  end

  def find_from(parent, style)
    @should_stop = nil
    @last_style  = nil

    recursive_search(parent, style)
  end

  def recursive_search(parent, style)
    parent.children.each do |ch|
      recursive_search(ch, style)
      return @last_style if @should_stop

      @should_stop = (ch == @stop_element)
      @last_style = ch if ch.name == style
    end

    @last_style    
  end

end

If this approach isn't scalable enough, then you might be able to optimize things by rewriting the recursive_search to not use recursion, and also pass in both of the styles you are looking for and keep track of last found, so you don't have to traverse the tree an extra time.

I'd also say try monkey patching Node to hook on when the document is getting parsed, but it looks like all of that is written in C. Perhaps you might be better served using something other than Nokogiri that has a native Ruby SAX parser (maybe REXML), or if speed is your real concern, do the search portion in C/C++ using Xerces or similar. I don't know how well these will deal with parsing HTML though.

like image 195
Aaron Hinni Avatar answered Oct 09 '22 06:10

Aaron Hinni


I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.

It's a single statement with XPath:

start = doc.at('div.block#X2')

start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>    

start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>

This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last() predicate ensures that you get the closest previous match.

like image 26
Mark Thomas Avatar answered Oct 09 '22 06:10

Mark Thomas


Maybe this will do it. I'm not sure about the performance and if there might be some cases that I haven't thought of.

def find(root, start, tag)
    ps, res = start, nil
    until res or (ps == root)
        ps  = ps.previous || ps.parent
        res = ps.css(tag).last
        res ||= ps.name == tag ? ps : nil
    end
    res || "Not found!"
end

parent_element_h1 =  find(parent, start_here, 'h1')
like image 36
sris Avatar answered Oct 09 '22 04:10

sris