I'm trying to find the most efficient way to test 300,000+ URLs in a database to basically check if the URLs are still valid. Having looked around the site I've found many excellent answers and am now using something along the lines of:
Read URL from file.... Test URL:
final URL url = new URL("http://" + address);
final HttpURLConnection urlConn = (HttpURLConnection) url.openConnection();
urlConn.setConnectTimeout(1000 * 10);
urlConn.connect();
urlConn.getResponseCode(); // Do something with the code
urlConn.disconnect();
Write details back to file....
So a couple of questions: 1) Is there a more efficient way to test URLs and get response codes?
2) Initially I am able to test about 50 URLs per minute, but after 5 or so minutes things really slow down - I imagine there is some resources I'm not releasing but am not sure what
3) Certain URLs (e.g. www.bhs.org.au) will cause the above to hang for minutes (not good when I have so many URLs to test) even with the connect timeout set, is there anyway I can tighten this up?
Thanks in advance for any help, it's been a quite a few years since I've written any code and I'm starting again from scratch :-)
By far the fastest way to do this would be to use java.nio to open a regular TCP connection to your target host on port 80. Then, simply send it a minimal HTTP request and process the result yourself.
The main advantage of this is that you can have a pool of 10 or 100 or even 1000 connections open and loading at the same time rather than having to do them one after the other. With this, for example, it won't matter much if one server (www.bhs.org.au) takes several minutes to respond. It'll simply hog one of your many connections in the pool, but others will keep running.
You could also achieve that same thing with a little more overhead but a lot less complex coding by using a Thread Pool to run many HttpURLConnections
(the way you are doing it now) in parallel in multiple threads.
This may or may not help, but you might want to change your request method to HEAD
instead of using the default, which is GET
:
urlConn.setRequestMethod("HEAD");
This tells the server that you do not really need a response back, other than the response code.
The article What Is a HTTP HEAD Request Good for describes some uses for HEAD
, including link verification:
[Head] asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.... This can be used for example for creating a faster link verification service.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With