Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.
If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:
\n\n\n\n\nh1. Test h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$!
So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.
You can now write HTML documents that contain embedded Ruby to generate forms and content dynamically.
The <plaintext> HTML element renders everything following the start tag as raw text, ignoring any following HTML. There is no closing tag, since everything after it is considered raw text. Warning: Do not use this element. <plaintext> is deprecated since HTML 2, and not all browsers implemented it.
var text = html. replace(/<\/?[^>]+>/gi, ' '); The problem with the above approach is that it may fail for malformed HTML or when the HTML content contains entities like dashes, ampersands and other punctuation codes.
Actually, this is much simpler:
require 'rubygems' require 'nokogiri' puts Nokogiri::HTML(my_html).text
You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With