I am trying to understand Nokogiri. Does anyone have a link to a basic example of Nokogiri parse/scrape showing the resultant tree. Think it would really help my understanding.

Using IRB and Ruby 1.9.2: Load Nokogiri: <pre class="prettyprint"><code>> require 'nokogiri' #=> true </code></pre> Parse a document: <pre class="prettyprint"><code>> doc = Nokogiri::HTML('<html><body>foobar</body></html>') #=> #<Nokogiri::HTML::Document:0x1012821a0 @node_cache = [], attr_accessor :errors = [], attr_reader :decorators = nil </code></pre> Nokogiri likes well formed docs. Note that it added the <code>DOCTYPE</code> because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized. <pre class="prettyprint"><code>> doc.to_html #=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>foobar</body></html>\n" </code></pre> Search the document to find the first <code></code> node using CSS and grab its content: <pre class="prettyprint"><code>> doc.at('p').text #=> "foobar" </code></pre> Use a different method name to do the same thing: <pre class="prettyprint"><code>> doc.at('p').content #=> "foobar" </code></pre> Search the document for all <code></code> nodes inside the <code><body></code> tag, and grab the content of the first one. <code>search</code> returns a nodeset, which is like an array of nodes. <pre class="prettyprint"><code>> doc.search('body p').first.text #=> "foobar" </code></pre> This is an important point, and one that trips up almost everyone when first using Nokogiri. <code>search</code> and its <code>css</code> and <code>xpath</code> variants return a NodeSet. <code>NodeSet.text</code> or <code>content</code> concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again. Using a little different HTML helps illustrate this: <pre class="prettyprint"><code>> doc = Nokogiri::HTML('<html><body>foobar</body></html>') > puts doc.to_html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body> foo bar </body></html> > doc.search('p').text #=> "foobar" > doc.search('p').map(&:text) #=> ["foo", "bar"] </code></pre> Returning back to the original HTML... Change the content of the node: <pre class="prettyprint"><code>> doc.at('p').content = 'bar' #=> "bar" </code></pre> Emit a parsed document as HTML: <pre class="prettyprint"><code>> doc.to_html #=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>bar</body></html>\n" </code></pre> Remove a node: <pre class="prettyprint"><code>> doc.at('p').remove #=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]> > doc.to_html #=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n" </code></pre> As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.

What are some examples of using Nokogiri?

1 Answers

Using IRB and Ruby 1.9.2:

Load Nokogiri:

> require 'nokogiri'
#=> true

Parse a document:

> doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>')
#=> #<Nokogiri::HTML::Document:0x1012821a0
      @node_cache = [],
      attr_accessor :errors = [],
      attr_reader :decorators = nil

Nokogiri likes well formed docs. Note that it added the DOCTYPE because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized.

> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"

Search the document to find the first  node using CSS and grab its content:

> doc.at('p').text
#=> "foobar"

Use a different method name to do the same thing:

> doc.at('p').content
#=> "foobar"

Search the document for all  nodes inside the <body> tag, and grab the content of the first one. search returns a nodeset, which is like an array of nodes.

> doc.search('body p').first.text
#=> "foobar"

This is an important point, and one that trips up almost everyone when first using Nokogiri. search and its css and xpath variants return a NodeSet. NodeSet.text or content concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.

Using a little different HTML helps illustrate this:

> doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>foo</p>
<p>bar</p>
</body></html>

> doc.search('p').text
#=> "foobar"

> doc.search('p').map(&:text)
#=> ["foo", "bar"]

Returning back to the original HTML...

Change the content of the node:

> doc.at('p').content = 'bar'
#=> "bar"

Emit a parsed document as HTML:

> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"

Remove a node:

> doc.at('p').remove
#=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]>
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"

As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.

157

answered Sep 30 '22 04:09

the Tin Man

Related questions
                            
                                How to convert html color names into RGB values in Ruby?
                            
                                Ruby's range step method causes very slow execution?
                            
                                Using group_by in rails/ruby
                            
                                How to make Ruby file run as executable?
                            
                                Ruby version for production
                            
                                How to monkeypatch Ruby properly?
                            
                                Unable to install rails with jRuby
                            
                                How do you call a method every hour in Rails?
                            
                                sending an email via Pony mail with no authentication?
                            
                                Where to start if I want to understand how compilers and programming languages are made [duplicate]
                            
                                RSpec -- lambda usage
                            
                                Ruby on Rails - Models and Relationship Table
                            
                                Context aware authorization using CanCan
                            
                                Method missing in Java or PHP
                            
                                Implementation Tree and other data structure with ruby [closed]
                            
                                How can I have console output display to stdout AND store it in a variable?
                            
                                Ruby curses colors
                            
                                Is it OK not to call Thread#join?
                            
                                Why does Ruby let me call a String method without specifying the string?
                            
                                Changing songs on jPlayer by clicking a link, hosted on Amazon S3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are some examples of using Nokogiri?

Tags:

ruby

nokogiri

user1094747

People also ask

1 Answers

the Tin Man

Recent Activity

Donate For Us