Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grabbing text between all tags in Nokogiri?

Tags:

ruby

nokogiri

what would be the most efficient way of grabbing all texts between html tags ?

<div>
<a> hi </a>
....

bunch of texts surrounded by html tags.

like image 201
KJW Avatar asked Oct 03 '09 05:10

KJW


2 Answers

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_s
like image 197
khelll Avatar answered Oct 17 '22 19:10

khelll


Use a Sax parser. Much faster than the XPath option.

require "nokogiri"

some_html = <<-HTML
<html>
  <head>
    <title>Title!</title>
  </head>
  <body>
    This is the body!
  </body>
</html>
HTML

class TextHandler < Nokogiri::XML::SAX::Document
  def initialize
    @chunks = []
  end

  attr_reader :chunks

  def cdata_block(string)
    characters(string)
  end

  def characters(string)
    @chunks << string.strip if string.strip != ""
  end
end
th = TextHandler.new
parser = Nokogiri::HTML::SAX::Parser.new(th)
parser.parse(some_html)
puts th.chunks.inspect
like image 35
Bob Aman Avatar answered Oct 17 '22 20:10

Bob Aman