Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip text from HTML document using Ruby

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.

What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.

I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.

Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.

like image 919
davidsmalley Avatar asked Sep 30 '09 11:09

davidsmalley


People also ask

What does it mean to strip HTML?

stripHtml( html ) Changes the provided HTML string into a plain text string by converting <br> , <p> , and <div> to line breaks, stripping all other tags, and converting escaped characters into their display values.

Can you use Ruby in HTML?

So far we've looked at using Ruby to create HTML output, but we can turn the problem inside out; we can actually embed Ruby in an HTML document. There are a number of packages that allow you to embed Ruby statements in some other sort of a document, especially in an HTML page.

What does Nokogiri do?

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (CRuby) and xerces (JRuby).


1 Answers

This works too:

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
like image 65
andre-r Avatar answered Sep 23 '22 12:09

andre-r