I'd like to open a web page with Nokogiri and extract all the words that a user sees when they visit the page in a browser and analyze the word frequency.
What is the easiest way of getting all readable words out of an html document with nokogiri? The ideal code snippet would take a html page (as a file, say) and give an array of individual words that come from all types of elements that are readable.
(No need to worry about javascript or css hiding elements and thus hiding words; just all words designed for display is fine.)
You want the Nokogiri::XML::Node#inner_text
method:
require 'nokogiri'
require 'open-uri'
html = Nokogiri::HTML(open 'http://stackoverflow.com/questions/6129357')
# Alternatively
html = Nokogiri::HTML(IO.read 'myfile.html')
text = html.at('body').inner_text
# Pretend that all words we care about contain only a-z, 0-9, or underscores
words = text.scan(/\w+/)
p words.length, words.uniq.length, words.uniq.sort[0..8]
#=> 907
#=> 428
#=> ["0", "1", "100", "15px", "2", "20", "2011", "220px", "24158nokogiri"]
# How about words that are only letters?
words = text.scan(/[a-z]+/i)
p words.length, words.uniq.length, words.uniq.sort[0..5]
#=> 872
#=> 406
#=> ["Answer", "Ask", "Badges", "Browse", "DocumentFragment", "Email"]
# Find the most frequent words
require 'pp'
def frequencies(words)
Hash[
words.group_by(&:downcase).map{ |word,instances|
[word,instances.length]
}.sort_by(&:last).reverse
]
end
pp frequencies(words)
#=> {"nokogiri"=>34,
#=> "a"=>27,
#=> "html"=>18,
#=> "function"=>17,
#=> "s"=>13,
#=> "var"=>13,
#=> "b"=>12,
#=> "c"=>11,
#=> ...
# Hrm...let's drop the javascript code out of our words
html.css('script').remove
words = html.at('body').inner_text.scan(/\w+/)
pp frequencies(words)
#=> {"nokogiri"=>36,
#=> "words"=>18,
#=> "html"=>17,
#=> "text"=>13,
#=> "with"=>12,
#=> "a"=>12,
#=> "the"=>11,
#=> "and"=>11,
#=> ...
If you really want to do this with Nokogiri (and you can otherwise just use regex to strip tags), then you should:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With