I'm trying to fill the variables parent_element_h1
and parent_element_h2
. Can anyone help me use Nokogiri to get the information I need into those variables?
require 'rubygems'
require 'nokogiri'
value = Nokogiri::HTML.parse(<<-HTML_END)
"<html>
<body>
<p id='para-1'>A</p>
<div class='block' id='X1'>
<h1>Foo</h1>
<p id='para-2'>B</p>
</div>
<p id='para-3'>C</p>
<h2>Bar</h2>
<p id='para-4'>D</p>
<p id='para-5'>E</p>
<div class='block' id='X2'>
<p id='para-6'>F</p>
</div>
</body>
</html>"
HTML_END
parent = value.css('body').first
# start_here is given: A Nokogiri::XML::Element of the <div> with the id 'X2
start_here = parent.at('div.block#X2')
# this should be a Nokogiri::XML::Element of the nearest, previous h1.
# in this example it's the one with the value 'Foo'
parent_element_h1 =
# this should be a Nokogiri::XML::Element of the nearest, previous h2.
# in this example it's the one with the value 'Bar'
parent_element_h2 =
Please note: The start_here
element could be anywhere inside the document. The HTML data is just an example. That said, the headers <h1>
and <h2>
could be a sibling of start_here
or a child of a sibling of start_here
.
The following recursive method is a good starting point, but it doesn't work on <h1>
because it's a child of a sibling of start_here
:
def search_element(_block,_style)
unless _block.nil?
if _block.name == _style
return _block
else
search_element(_block.previous,_style)
end
else
return false
end
end
parent_element_h1 = search_element(start_here,'h1')
parent_element_h2 = search_element(start_here,'h2')
After accepting an answer, I came up with my own solution. It works like a charm and I think it's pretty cool.
Nokogiri makes an attempt to determine whether a CSS or XPath selector is being passed in. It's possible to create a selector that fools at or search so occasionally it will misunderstand, which is why we have the more specific versions of the methods.
Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (C) and xerces (Java).
Nokogiri is a dependency of rails-dom-testing which is required by Rails. As far as I see it rails-dom-testing is used to verify certain parts of a rendered HTML/CSS page. Nokogiri can be a great tool, but it's also a 800-pound gorilla. It's unpleasant that it's a Rails dependency IMHO.
The approach I would take (if I am understanding your problem) is to use XPath or CSS to search for your "start_here" element and the parent element that you want to search under. Then, recursively walk the tree starting at parent, stopping when you hit the "start_here" element, and holding onto the last element that matches your style along the way.
Something like:
parent = value.search("//body").first
div = value.search("//div[@id = 'X2']").first
find = FindPriorTo.new(div)
assert_equal('Foo', find.find_from(parent, 'h1').text)
assert_equal('Bar', find.find_from(parent, 'h2').text)
Where FindPriorTo
is a simple class to handle the recursion:
class FindPriorTo
def initialize(stop_element)
@stop_element = stop_element
end
def find_from(parent, style)
@should_stop = nil
@last_style = nil
recursive_search(parent, style)
end
def recursive_search(parent, style)
parent.children.each do |ch|
recursive_search(ch, style)
return @last_style if @should_stop
@should_stop = (ch == @stop_element)
@last_style = ch if ch.name == style
end
@last_style
end
end
If this approach isn't scalable enough, then you might be able to optimize things by rewriting the recursive_search
to not use recursion, and also pass in both of the styles you are looking for and keep track of last found, so you don't have to traverse the tree an extra time.
I'd also say try monkey patching Node to hook on when the document is getting parsed, but it looks like all of that is written in C. Perhaps you might be better served using something other than Nokogiri that has a native Ruby SAX parser (maybe REXML), or if speed is your real concern, do the search portion in C/C++ using Xerces or similar. I don't know how well these will deal with parsing HTML though.
I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.
It's a single statement with XPath:
start = doc.at('div.block#X2')
start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>
start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>
This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last()
predicate ensures that you get the closest previous match.
Maybe this will do it. I'm not sure about the performance and if there might be some cases that I haven't thought of.
def find(root, start, tag)
ps, res = start, nil
until res or (ps == root)
ps = ps.previous || ps.parent
res = ps.css(tag).last
res ||= ps.name == tag ? ps : nil
end
res || "Not found!"
end
parent_element_h1 = find(parent, start_here, 'h1')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With