Let's say my HTML document is like:
<div class="headline">News</div>
<p>Some interesting news here</p>
<div class="headline">Sports</div>
<p>Baseball is fun!</p>
I can get the headline
divs with the following code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "mypage.html"
doc = Nokogiri::HTML(open(url))
doc.css(".headline").each do |item|
puts item.text
end
But how do I access the content in the following p
tag so that News
is related to Some interesting news here
, etc?
Nokogiri makes an attempt to determine whether a CSS or XPath selector is being passed in. It's possible to create a selector that fools at or search so occasionally it will misunderstand, which is why we have the more specific versions of the methods.
One of the best gems for Ruby on Rails is Nokogiri which is a library to deal with XML and HTML documents. The most common use for a parser like Nokogiri is to extract data from structured documents.
Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (CRuby) and xerces (JRuby).
You want Node#next_element:
doc.css(".headline").each do |item|
puts item.text
puts item.next_element.text
end
There is also item.next
, but that will also return text nodes, where item.next_element
will return only element nodes (like p
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With