I am trying to find all the broken links in the webpage using Java. Here is the code:
private static boolean isLive(String link){
HttpURLConnection urlconn = null;
int res = -1;
String msg = null;
try{
URL url = new URL(link);
urlconn = (HttpURLConnection)url.openConnection();
urlconn.setConnectTimeout(10000);
urlconn.setRequestMethod("GET");
urlconn.connect();
String redirlink = urlconn.getHeaderField("Location");
System.out.println(urlconn.getHeaderFields());
if(redirlink != null && !url.toExternalForm().equals(redirlink))
return isLive(redirlink);
else
return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK;
}catch(Exception e){
System.out.println(e.getMessage());
return false;
}finally{
if(urlconn != null)
urlconn.disconnect();
}
}
public static void main(String[] s){
String link = "http://www.somefakesite.net";
System.out.println(isLive(link));
}
Code referred from http://nscraps.com/Java/146-program-code-broken-link-checker.htm.
This code gives HTTP 200 status for all webpages including the broken ones. For example http://www.somefakesite.net/ gives the following header fields:
{null=[HTTP/1.1 200 OK], Date=[Sun, 15 May 2011 18:51:29 GMT], Transfer-Encoding=[chunked], Keep-Alive=[timeout=4, max=100], Connection=[Keep-Alive], Content-Type=[text/html], Server=[Apache/2.2.15 (Win32) PHP/5.2.12], X-Powered-By=[PHP/5.2.9-1]}
Even though such sites do not exist, how to classify it as a broken link?
Maybe the issue is that currently lots of webserver and DNS providers detect those "broken" links and redirect you to their "not found" pages.
Test it against an URL that you know sends the 404 code (it shows the browser original message).
EDIT to answer the comment by the author (as it is too long to fit in a comment): I do not see an easy answer for your problem, but there are several different types of failures:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With