Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if URL exists in Ruby

Tags:

ruby

How would I go about checking if a URL exists using Ruby?

For example, for the URL

https://google.com 

the result should be truthy, but for the URLs

https://no.such.domain 

or

https://stackoverflow.com/no/such/path 

the result should be falsey

like image 287
Shrikanth Hathwar Avatar asked May 06 '11 07:05

Shrikanth Hathwar


2 Answers

Use the Net::HTTP library.

require "net/http" url = URI.parse("http://www.google.com/") req = Net::HTTP.new(url.host, url.port) res = req.request_head(url.path) 

At this point res is a Net::HTTPResponse object containing the result of the request. You can then check the response code:

do_something_with_it(url) if res.code == "200" 

Note: To check for https based url, use_ssl attribute should be true as:

require "net/http" url = URI.parse("https://www.google.com/") req = Net::HTTP.new(url.host, url.port) req.use_ssl = true res = req.request_head(url.path) 
like image 165
Simone Carletti Avatar answered Sep 25 '22 01:09

Simone Carletti


Sorry for the late reply on this, but I think this deserves a better answer.

There are three ways to look at this question:

  1. Strict check if the URL exist
  2. Check if you are requesting the URL correctly
  3. Check if you can request it correctly and the server can answer it correctly

1. Strict check if the URL exist

While 200 means that the server answers to that URL (thus, the URL exists), answering other status code doesn't means that the URL does not exist. For example, answering 302 - redirected means that the URL exists and is redirecting to another one. While browsing, 302 many times behaves the same than 200 to the final user. Other status code that can be returned if a URL exists is 500 - internal server error. After all, if the URL does not exists, how it comes the application server processed your request instead return simply 404 - not found?

So there are actually only two cases when a URL does not exist: When the server does not exist or when the server exists but can't find the given URL path does not exist. Thus, the only way to check if the URL exists is checking if the server answers and the return code is not 404. The following code does just that.

require "net/http" def url_exist?(url_string)   url = URI.parse(url_string)   req = Net::HTTP.new(url.host, url.port)   req.use_ssl = (url.scheme == 'https')   path = url.path if url.path.present?   res = req.request_head(path || '/')   res.code != "404" # false if returns 404 - not found rescue Errno::ENOENT   false # false if can't find the server end 

2. Check if you are requesting the URL correctly

However, most of the times we are not interested in see if a URL exists, but if we can access it. Fortunately looking to the HTTP status codes families, that is the 4xx family, which states for client error (thus, an error in your side, which means you are not requesting the page correctly, don't have permission or whatsoever). This is a good of errors to check if you can access this page. From wiki:

The 4xx class of status code is intended for cases in which the client seems to have erred. Except when responding to a HEAD request, the server should include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. These status codes are applicable to any request method. User agents should display any included entity to the user.

So the following code make sure the URL exists and you can access it:

require "net/http" def url_exist?(url_string)   url = URI.parse(url_string)   req = Net::HTTP.new(url.host, url.port)   req.use_ssl = (url.scheme == 'https')   path = url.path if url.path.present?   res = req.request_head(path || '/')   if res.kind_of?(Net::HTTPRedirection)     url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL    else     res.code[0] != "4" #false if http code starts with 4 - error on your side.   end rescue Errno::ENOENT   false #false if can't find the server end 

3. Check if you can request it correctly and the server can answer it correctly

Just like the 4xx family checks if you can access the URL, the 5xx family checks if the server had any problem answering your request. An error on this family most of the times are due problems on the server itself, and hopefully they are working on solve it. If You need to be able to access the page and get a correct answer now, you should make sure the answer is not from 4xx or 5xx family, and if you was redirected, the redirected page answers correctly. So much similar to (2), you can simply use the following code:

require "net/http" def url_exist?(url_string)   url = URI.parse(url_string)   req = Net::HTTP.new(url.host, url.port)   req.use_ssl = (url.scheme == 'https')   path = url.path if url.path.present?   res = req.request_head(path || '/')   if res.kind_of?(Net::HTTPRedirection)     url_exist?(res['location']) # Go after any redirect and make sure you can access the redirected URL    else     ! %W(4 5).include?(res.code[0]) # Not from 4xx or 5xx families   end rescue Errno::ENOENT   false #false if can't find the server end 
like image 41
fotanus Avatar answered Sep 25 '22 01:09

fotanus