I have just started learning Ruby. Very cool language, liking it a lot.
I am using the very handy Hpricot HTML parser.
What I am looking to do is grab all the text from the page, excluding the HTML tags.
Example:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Data Protection Checks</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
This is what I want to grab.
</div>
<p>
I also want to grab this text
</p>
</body>
</html>
I am basically wanting to grab only the text so I end up with a string like so:
"This is what I want to grab. I also want to grab this text"
What would be the best method of doing this?
Cheers
Eef
You can do this using the XPath text()
selector.
require 'hpricot'
require 'open-uri'
doc = open("http://stackoverflow.com/") { |f| Hpricot(f) }
text = (doc/"//*/text()") # array of text values
puts text.join("\n")
However this is a fair expensive operation. A better solution might be available.
You might want to try inner_text.
Like this:
h = Hpricot("<html><body><a href='http://yoursite.com?utm=trackmeplease'>http://yoursite.com</a> is <strong>awesome</strong>")
puts h.inner_text
http://yoursite.com is awesome
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With