Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove comments from inner_html

Tags:

ruby

nokogiri

I have some code that uses Nokogiri and I am trying to get the inner_html without getting the comments.

html = Nokogiri::HTML(open(@sql_scripts_url[1])) #using first value of the array
html.css('td[class="ms-formbody"]').each do |node|
  puts node.inner_html # prints comments
end
like image 761
Maverick Avatar asked Oct 24 '11 17:10

Maverick


1 Answers

Since you have not provided any sample HTML or desired output, here's a general solution:

You can select SGML comments in XPath by using the comment() node test; you can strip them out of the document by calling .remove on all comment nodes. Illustrated:

require 'nokogiri'
doc  = Nokogiri.XML('<r><b>hello</b> <!-- foo --> world</r>')
p doc.inner_html                        #=> "<b>hello</b> <!-- foo --> world"
doc.xpath('//comment()').remove
p doc.inner_html                        #=> "<b>hello</b>  world"

Note that the above modifies the document destructively to remove the comments. If you wish to keep the original document unmodified, you could alternatively do this:

class Nokogiri::XML::Node
  def inner_html_reject(xpath='.//comment()')
    dup.tap{ |shadow| shadow.xpath(xpath).remove }.inner_html
  end
end

doc = Nokogiri.XML('<r><b>hello</b> <!-- foo --> world</r>')
p doc.inner_html_reject #=> "<r><b>hello</b>  world</r>"
p doc.inner_html        #=> "<r><b>hello</b> <!-- foo --> world</r>"

Finally, note that if you don't need the markup, just asking for the text itself does not include HTML comments:

p doc.text              #=> "hello  world"
like image 87
Phrogz Avatar answered Oct 19 '22 18:10

Phrogz