what would be the most efficient way of grabbing all texts between html tags ?
<div>
<a> hi </a>
....
bunch of texts surrounded by html tags.
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_s
Use a Sax parser. Much faster than the XPath option.
require "nokogiri"
some_html = <<-HTML
<html>
<head>
<title>Title!</title>
</head>
<body>
This is the body!
</body>
</html>
HTML
class TextHandler < Nokogiri::XML::SAX::Document
def initialize
@chunks = []
end
attr_reader :chunks
def cdata_block(string)
characters(string)
end
def characters(string)
@chunks << string.strip if string.strip != ""
end
end
th = TextHandler.new
parser = Nokogiri::HTML::SAX::Parser.new(th)
parser.parse(some_html)
puts th.chunks.inspect
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With