Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check for broken links

I am trying to find all the broken links in the webpage using Java. Here is the code:

   private static boolean isLive(String link){

    HttpURLConnection urlconn = null;
    int res = -1;
    String msg = null;
    try{

        URL url = new URL(link);
        urlconn = (HttpURLConnection)url.openConnection();
        urlconn.setConnectTimeout(10000);
        urlconn.setRequestMethod("GET");
        urlconn.connect();
        String redirlink = urlconn.getHeaderField("Location");
        System.out.println(urlconn.getHeaderFields());
        if(redirlink != null && !url.toExternalForm().equals(redirlink))
            return isLive(redirlink);
        else
            return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK;

    }catch(Exception e){

      System.out.println(e.getMessage());
      return false;

    }finally{

        if(urlconn != null)
            urlconn.disconnect();

    }


}

public static void main(String[] s){

    String link = "http://www.somefakesite.net";
    System.out.println(isLive(link));

}

Code referred from http://nscraps.com/Java/146-program-code-broken-link-checker.htm.

This code gives HTTP 200 status for all webpages including the broken ones. For example http://www.somefakesite.net/ gives the following header fields:

{null=[HTTP/1.1 200 OK], Date=[Sun, 15 May 2011 18:51:29 GMT], Transfer-Encoding=[chunked], Keep-Alive=[timeout=4, max=100], Connection=[Keep-Alive], Content-Type=[text/html], Server=[Apache/2.2.15 (Win32) PHP/5.2.12], X-Powered-By=[PHP/5.2.9-1]}

Even though such sites do not exist, how to classify it as a broken link?

like image 406
user754740 Avatar asked May 15 '11 18:05

user754740


1 Answers

Maybe the issue is that currently lots of webserver and DNS providers detect those "broken" links and redirect you to their "not found" pages.

Test it against an URL that you know sends the 404 code (it shows the browser original message).


EDIT to answer the comment by the author (as it is too long to fit in a comment): I do not see an easy answer for your problem, but there are several different types of failures:

  • For DNS failures that are redirected (an URL that cannot be found by the DNS, and you get redirected to another page). All redirections (if you are redirected) will likely go to the same page (provided by your ISP/DNS provider), you can check for that. Of course, if you try with another ISP/DNS provider the page might be different. If you are not being redirected then you will get a connection error.
  • For a server with valid DNSs but not working (for example, google.com goes down), there should be a connection error.
  • For a resource ("page") missing in a server, it is more difficult. 404 means it is broken, but if the server does not send it there is little more to do. A redirection might be useful to flag a link as dubious, but it should be manually checked later because it is not only used for capturing missing links (for example, www.google.com redirects me www.google.es)
like image 115
SJuan76 Avatar answered Sep 19 '22 18:09

SJuan76