Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML to Plain Text with Ruby?

Tags:

ruby

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.

If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:

\n\n\n\n\nh1. Test h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$! 

So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.

like image 655
Lance Avatar asked Mar 24 '10 03:03

Lance


People also ask

Can you use Ruby in HTML?

You can now write HTML documents that contain embedded Ruby to generate forms and content dynamically.

Can HTML contain raw text?

The <plaintext> HTML element renders everything following the start tag as raw text, ignoring any following HTML. There is no closing tag, since everything after it is considered raw text. Warning: Do not use this element. <plaintext> is deprecated since HTML 2, and not all browsers implemented it.

How do I convert typescript to plain text in HTML?

var text = html. replace(/<\/?[^>]+>/gi, ' '); The problem with the above approach is that it may fail for malformed HTML or when the HTML content contains entities like dashes, ampersands and other punctuation codes.


1 Answers

Actually, this is much simpler:

require 'rubygems' require 'nokogiri'  puts Nokogiri::HTML(my_html).text 

You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.

like image 162
Matchu Avatar answered Sep 28 '22 01:09

Matchu