Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing <p> elements with no text with Nokogiri

Tags:

ruby

nokogiri

Given an HTML document in Nokogiri, I want to remove all <p> nodes with no actual text. This includes <p> elements with whitespace and/or <br/> tags. What's the most elegant way to do this?

like image 587
dan Avatar asked Dec 10 '22 06:12

dan


2 Answers

This is a simpler fix: it removes both the whitespace and the br tags.

given the HTML

"<p>  </p><p>Foo<p/><p><br/> <br>   </p>"

Solution:

document.css('p').find_all.each do |p|
    # Ruby on Rails Solution:
    p.remove if p.content.blank?

    # Ruby solution, as pointed out by Michael Hartl:
    p.remove if p.content.strip.empty?
end
# document => <p>Foo</p> 
like image 90
davegson Avatar answered Dec 31 '22 14:12

davegson


I would start with a method like this one (feel free to monkeypatch Nokogiri::XML::Node if you want to)

def is_blank?(node)
  (node.text? && node.content.strip == '') || (node.element? && node.name == 'br')
end

Then continue with another method that checks that all children are blank:

def all_children_are_blank?(node)
  node.children.all?{|child| is_blank?(child) } 
  # Here you see the convenience of monkeypatching... sometimes.
end

And finally, get the document and

document.css('p').find_all{|p| all_children_are_blank?(p) }.each do |p|
  p.remove
end
like image 26
Serabe Avatar answered Dec 31 '22 13:12

Serabe