Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get the HTML body from a URL using Nokogiri in Rails?

I want to parse the body attributes from a URL.

For example:

url = 'http://rca.yandex.com/?key=rca.1.1.20140120T051507Z.3db118ab435efdff.6c84331313b6b7d66abd191410f72e0e1c3c8795&url=http://endtimeheadlines.wordpress.com/2014/01/17/think-tank-extraordinary-crisis-needed-to-preserve-new-world-order/#comment-36708?utm_source=twitterfeed&utm_medium=facebook[&callback=http://64.191.99.245:3023/posts][&full=1]'

When I try:

page = Nokogiri::HTML(html)

I get:

#<Nokogiri::HTML::Document:0x52fd6d6 name="document" children=[#<Nokogiri::XML::DTD:0x52fd1f4 name="html">, #<Nokogiri::XML::Element:0x52fc6aa name="html" children=[#<Nokogiri::XML::Element:0x5301f56 name="body" children=[#<Nokogiri::XML::Element:0x53018d0 name="p" children=[#<Nokogiri::XML::Text:0x53015f6 "http://rca.yandex.com/?key=rca.1.1.20140120T051507Z.3db118ab435efdff.6c84331313b6b7d66abd191410f72e0e1c3c8795&url=http://endtimeheadlines.wordpress.com/2014/01/17/think-tank-extraordinary-crisis-needed-to-preserve-new-world-order/#comment-36708?utm_source=twitterfeed&utm_medium=facebook[&callback=http://64.191.99.245:3023/posts][&full=1]">]>]>]>]>

How do I get the attributes inside this URL?

For example: page.css("div"). I want to get the value from HTML body.

like image 294
tardjo Avatar asked Jan 20 '14 06:01

tardjo


2 Answers

It's not exactly clear what you're trying to do, but this might help:

require 'nokogiri'

html = '<html><head><title>foo</title><body><p>bar</p></body></html>'

doc = Nokogiri::HTML(html)

Using at, you'll find the first occurrence of the tag, which is sensible in a HTML document since you should only have a single <body> tag.

doc.at('body') # => #<Nokogiri::XML::Element:0x3ff194d24cd4 name="body" children=[#<Nokogiri::XML::Element:0x3ff194d24acc name="p" children=[#<Nokogiri::XML::Text:0x3ff194d248c4 "bar">]>]>

If you want the children of the tag, use children to retrieve them:

doc.at('body').children # => [#<Nokogiri::XML::Element:0x3ff194d24acc name="p" children=[#<Nokogiri::XML::Text:0x3ff194d248c4 "bar">]>]

If you want to get the child nodes as HTML:

doc.at('body').children.to_html # => "<p>bar</p>"
doc.at('body').inner_html # => "<p>bar</p>"

If you want the text content of the body tag:

doc.at('body').content # => "bar"
doc.at('body').text # => "bar"

If, by "attributes", you really mean the attributes of the <body> tag itself:

require 'nokogiri'

html = '<html><head><title>foo</title><body on_load="do_something()"><p>bar</p></body></html>'

doc = Nokogiri::HTML(html)
doc.at('body').attributes # => {"on_load"=>#<Nokogiri::XML::Attr:0x3fdc3d923ca0 name="on_load" value="do_something()">}
doc.at('body')['on_load'] # => "do_something()"

attributes returns a hash, so you can directly access anything you want. As a shortcut, a Nokogiri::XML::Node also understands [] giving us a typical Hash-style access to the value.

like image 163
the Tin Man Avatar answered Sep 21 '22 14:09

the Tin Man


page.css('body') should work. If not try using to_s

like image 41
skozz Avatar answered Sep 20 '22 14:09

skozz