I tried many URLs on this and they seem to be fine until I came across this particular one:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html"))
puts doc
This is the result:
/Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:353:in `open_http': 404 Not Found (OpenURI::HTTPError)
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:689:in `open'
from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:34:in `open'
from test.rb:5:in `<main>'
I can access this from a web browser, I just don't get it at all.
What is going on, and how can I deal with this kind of error? Can I ignore it and let the rest do their work?
You're getting 404 Not Found (OpenURI::HTTPError)
, so, if you want to allow your code to continue, rescue for that exception. Something like this should work:
require 'nokogiri'
require 'open-uri'
URLS = %w[
http://www.moxyst.com/fashion/men-clothing/underwear.html
]
URLs.each do |url|
begin
doc = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError => e
puts "Can't access #{ url }"
puts e.message
puts
next
end
puts doc.to_html
end
You can use more generic exceptions, but then you run into problems getting weird output or might handle an unrelated problem in a way that causes more problems, so you'll need to figure out the granularity you need.
You could even sniff either the HTTPd headers, the status of the response, or look at the exception message if you want even more control and want to do something different for a 401 or a 404.
I can access this from a web browser, I just don't get it at all.
Well, that could be something happening on the server side: Perhaps they don't like the UserAgent string you're sending? The OpenURI documentation shows how to change that header:
Additional header fields can be specified by an optional hash argument.
open("http://www.ruby-lang.org/en/", "User-Agent" => "Ruby/#{RUBY_VERSION}", "From" => "[email protected]", "Referer" => "http://www.ruby-lang.org/") {|f| # ... }
You might need to pass 'User-Agent' as parameter to open method. Some sites require a valid User-Agent otherwise they simply don't respond or show a 404 not found error.
doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html", "User-Agent" => "MyCrawlerName (http://mycrawler-url.com)"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With