I am trying to understand Nokogiri. Does anyone have a link to a basic example of Nokogiri parse/scrape showing the resultant tree. Think it would really help my understanding.
Nokogiri (htpp://nokogiri.org/) is the most popular open source Ruby gem for HTML and XML parsing. It parses HTML and XML documents into node sets and allows for searching with CSS3 and XPath selectors. It may also be used to construct new HTML and XML objects.
One of the best gems for Ruby on Rails is Nokogiri which is a library to deal with XML and HTML documents. The most common use for a parser like Nokogiri is to extract data from structured documents. Examples: A list of prices from a price comparison website.
To scrap from a website you need the url from the page you want to scrape from. Then pass the the url to the URI. open method to get the HTML. After that pass the HTML to the Nokogiri::HTML method to get a set of nodes that you can parse through using Nokogiri.
To parse XML-documents, I recommend the gem nokogiri .
Using IRB and Ruby 1.9.2:
Load Nokogiri:
> require 'nokogiri'
#=> true
Parse a document:
> doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>')
#=> #<Nokogiri::HTML::Document:0x1012821a0
@node_cache = [],
attr_accessor :errors = [],
attr_reader :decorators = nil
Nokogiri likes well formed docs. Note that it added the DOCTYPE
because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized.
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"
Search the document to find the first <p>
node using CSS and grab its content:
> doc.at('p').text
#=> "foobar"
Use a different method name to do the same thing:
> doc.at('p').content
#=> "foobar"
Search the document for all <p>
nodes inside the <body>
tag, and grab the content of the first one. search
returns a nodeset, which is like an array of nodes.
> doc.search('body p').first.text
#=> "foobar"
This is an important point, and one that trips up almost everyone when first using Nokogiri. search
and its css
and xpath
variants return a NodeSet. NodeSet.text
or content
concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.
Using a little different HTML helps illustrate this:
> doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>foo</p>
<p>bar</p>
</body></html>
> doc.search('p').text
#=> "foobar"
> doc.search('p').map(&:text)
#=> ["foo", "bar"]
Returning back to the original HTML...
Change the content of the node:
> doc.at('p').content = 'bar'
#=> "bar"
Emit a parsed document as HTML:
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"
Remove a node:
> doc.at('p').remove
#=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]>
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"
As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With