Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby parsing HTTPresponse with Nokogiri

Tags:

ruby

nokogiri

Parsing HTTPresponse with Nokogiri

Hi, I am having trouble parsing HTTPresponse objects with Nokogiri.

I use this function to fetch a website here:

fetch a link

def fetch(uri_str, limit = 10)
   
  
  # You should choose better exception.
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0
  
  url = URI.parse(URI.encode(uri_str.strip))
  puts url
  
  #get path
  req = Net::HTTP::Get.new(url.path,headers)
  #start TCP/IP
  response = Net::HTTP.start(url.host,url.port) { |http|
        http.request(req)
  }
  case response
  when Net::HTTPSuccess
    then #print final redirect to a file
    puts "this is location" + uri_str
    puts "this is the host #{url.host}"
    puts "this is the path #{url.path}"
    
    return response
    # if you get a 302 response
  when Net::HTTPRedirection 
    then 
    puts "this is redirect" + response['location']
    return fetch(response['location'],aFile, limit - 1)
  else
    response.error!
  end
end




            html = fetch("http://www.somewebsite.com/hahaha/")
            puts html
            noko = Nokogiri::HTML(html)
            

When I do this html prints a whole bunch of gibberish and Nokogiri complains that "node_set must be a Nokogiri::XML::NOdeset

If anyone could offer help it would be quite appreciated

like image 807
Max Pie Avatar asked Jul 05 '12 12:07

Max Pie


Video Answer


1 Answers

First thing. Your fetch method returns a Net::HTTPResponse object and not just the body. You should provide the body to Nokogiri.

response = fetch("http://www.somewebsite.com/hahaha/")
puts response.body
noko = Nokogiri::HTML(response.body)

I've updated your script so it's runnable (bellow). A couple of things were undefined.

require 'nokogiri'
require 'net/http'

def fetch(uri_str, limit = 10)
  # You should choose better exception.
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  url = URI.parse(URI.encode(uri_str.strip))
  puts url

  #get path
  headers = {}
  req = Net::HTTP::Get.new(url.path,headers)
  #start TCP/IP
  response = Net::HTTP.start(url.host,url.port) { |http|
        http.request(req)
  }

  case response
  when Net::HTTPSuccess
    then #print final redirect to a file
    puts "this is location" + uri_str
    puts "this is the host #{url.host}"
    puts "this is the path #{url.path}"

    return response
    # if you get a 302 response
  when Net::HTTPRedirection
    then
    puts "this is redirect" + response['location']
    return fetch(response['location'], limit-1)
  else
    response.error!
  end
end

response = fetch("http://www.google.com/")
puts response
noko = Nokogiri::HTML(response.body)
puts noko

The script gives no error and prints the content. You may be getting Nokogiri error due to the content you're receiving. One common problem I've encountered with Nokogiri is character encoding. Without the exact error it's impossible to tell what's going on.

I'd recommnend looking at the following StackOverflow Questions

ruby 1.9: invalid byte sequence in UTF-8 (specifically this answer)

How to convert a Net::HTTP response to a certain encoding in Ruby 1.9.1?

like image 152
Pierre-Luc Simard Avatar answered Oct 17 '22 08:10

Pierre-Luc Simard