Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there any safe assumptions to make about the availability of a URL?

Tags:

http

url

I am trying to determine if there is a way to check the availability of a potentially large list of urls (> 1000000) without having to send a GET request to every single one.

Is it safe to assume that if http://www.example.com is inaccessible (as in unable to connect to server or the DNS request for the domain fails), or I get a 4XX or 5XX response, then anything from that domain will also be inaccessible (e.g. http://www.example.com/some/path/to/a/resource/named/whatever.jpg)? Would a 302 response (say for whatever.jpg) be enough to invalidate the first assumption? I imagine sub domains should be considered distinct as http://subdomain.example.com and http://www.example.com may not direct to the same ip?

I seem to be able to think of a counter example for each shortcut I come up with. Should I just bite the bullet and send out GET requests to every URL?

like image 276
Kevin Loney Avatar asked Dec 18 '22 09:12

Kevin Loney


2 Answers

Unfortunately, no you cannot infer anything from 4xx or 5xx or any other codes.

Those codes are for individual pages, not for the server. It's quite possible that one page is down and another is up, or one has a 500 server-side error and another doesn't.

What you can do is use HEAD instead of GET. That retrieves the MIME header for the page but not the page content. This saves time server-side (because it doesn't have to render the page) and for yourself (because you don't have to buffer and then discard content).

Also I suggest you use keep-alive to accelerate responses from the same server. Many HTTP client libraries will do this for you.

like image 135
Jason Cohen Avatar answered Jun 02 '23 02:06

Jason Cohen


A failed DNS lookup for a host (e.g. www.example.com) should be enough to invalidate all URLs for that host. Subdomains or other hosts would have to be checked separately though.

A 4xx code might tell you that a particular page isn't available, but you couldn't make any assumptions about other pages from that.

A 5xx code really won't tell you anything. For example, it could be that the page is there, but the server is just too busy at the moment. If you try it again later it might work fine.

like image 36
Eric Petroelje Avatar answered Jun 02 '23 00:06

Eric Petroelje