Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

404 not found, but can access normally from web browser

I tried many URLs on this and they seem to be fine until I came across this particular one:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html"))
puts doc

This is the result:

/Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:353:in `open_http': 404 Not Found (OpenURI::HTTPError)
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:709:in `buffer_open'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:210:in `block in open_loop'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `catch'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:208:in `open_loop'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:149:in `open_uri'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:689:in `open'
    from /Users/macbookair/.rvm/rubies/ruby-2.0.0-p481/lib/ruby/2.0.0/open-uri.rb:34:in `open'
    from test.rb:5:in `<main>'  

I can access this from a web browser, I just don't get it at all.

What is going on, and how can I deal with this kind of error? Can I ignore it and let the rest do their work?

like image 909
iboss Avatar asked Sep 05 '14 18:09

iboss


2 Answers

You're getting 404 Not Found (OpenURI::HTTPError), so, if you want to allow your code to continue, rescue for that exception. Something like this should work:

require 'nokogiri'
require 'open-uri'

URLS = %w[
  http://www.moxyst.com/fashion/men-clothing/underwear.html
]

URLs.each do |url|
  begin
    doc = Nokogiri::HTML(open(url))
  rescue OpenURI::HTTPError => e
    puts "Can't access #{ url }"
    puts e.message
    puts
    next
  end
  puts doc.to_html
end

You can use more generic exceptions, but then you run into problems getting weird output or might handle an unrelated problem in a way that causes more problems, so you'll need to figure out the granularity you need.

You could even sniff either the HTTPd headers, the status of the response, or look at the exception message if you want even more control and want to do something different for a 401 or a 404.

I can access this from a web browser, I just don't get it at all.

Well, that could be something happening on the server side: Perhaps they don't like the UserAgent string you're sending? The OpenURI documentation shows how to change that header:

Additional header fields can be specified by an optional hash argument.

open("http://www.ruby-lang.org/en/",
  "User-Agent" => "Ruby/#{RUBY_VERSION}",
  "From" => "[email protected]",
  "Referer" => "http://www.ruby-lang.org/") {|f|
  # ...
}
like image 167
the Tin Man Avatar answered Sep 29 '22 05:09

the Tin Man


You might need to pass 'User-Agent' as parameter to open method. Some sites require a valid User-Agent otherwise they simply don't respond or show a 404 not found error.

doc = Nokogiri::HTML(open("http://www.moxyst.com/fashion/men-clothing/underwear.html", "User-Agent" => "MyCrawlerName (http://mycrawler-url.com)"))
like image 22
Deepak Sharma Avatar answered Sep 29 '22 03:09

Deepak Sharma