Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select only leaf nodes with Nokogiri?

I am looking for some advices on how it could be done. I'm trying a solution only with xpath:

An html example:

<div>
  <div>
    <div>text div (leaf)</div>
    <p>text paragraph (leaf)</p>
  </div>
</div>
<p>text paragraph 2 (leaf)</p>

Code:

doc = Nokogiri::HTML.fragment("- the html above -")
result = doc.xpath("*[not(child::*)]")


[#<Nokogiri::XML::Element:0x3febf50f9328 name="p" children=[#<Nokogiri::XML::Text:0x3febf519b718 "text paragraph 2 (leaf)">]>] 

But this xpath only gives me the last "p". What I want is like a flatten behavior, only returning the leaf nodes.

Here are some reference answers in stackoverflow:

How to select all leaf nodes using XPath expression?

XPath - Get node with no child of specific type

Thanks

like image 238
Luccas Avatar asked Dec 16 '22 08:12

Luccas


1 Answers

You can find all element nodes that have no child elements using:

//*[not(*)]

Example:

require 'nokogiri'

doc = Nokogiri::HTML.parse <<-end
<div>
  <div>
    <div>text div (leaf)</div>
    <p>text paragraph (leaf)</p>
  </div>
</div>
<p>text paragraph 2 (leaf)</p>
end

puts doc.xpath('//*[not(*)]').length
#=> 3

doc.xpath('//*[not(*)]').each do |e|
    puts e.text
end
#=> "text div (leaf)"
#=> "text paragraph (leaf)"
#=> "text paragraph 2 (leaf)"
like image 68
Justin Ko Avatar answered Dec 31 '22 00:12

Justin Ko