Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I check for valid (not dead) links programmatically using PHP?

Tags:

url

php

cron

Given a list of urls, I would like to check that each url:

  • Returns a 200 OK status code
  • Returns a response within X amount of time

The end goal is a system that is capable of flagging urls as potentially broken so that an administrator can review them.

The script will be written in PHP and will most likely run on a daily basis via cron.

The script will be processing approximately 1000 urls at a go.

Question has two parts:

  • Are there any bigtime gotchas with an operation like this, what issues have you run into?
  • What is the best method for checking the status of a url in PHP considering both accuracy and performance?
like image 442
GloryFish Avatar asked Oct 28 '08 19:10

GloryFish


People also ask

How do you check the URL is valid or not in PHP?

Existence of an URL can be checked by checking the status code in the response header. The status code 200 is Standard response for successful HTTP requests and status code 404 means URL doesn't exist.

How do you check if a link is valid or not?

You can use the URLConstructor to check if a string is a valid URL. URLConstructor ( new URL(url) ) returns a newly created URL object defined by the URL parameters. A JavaScript TypeError exception is thrown if the given URL is not valid.


1 Answers

Use the PHP cURL extension. Unlike fopen() it can also make HTTP HEAD requests which are sufficient to check the availability of a URL and save you a ton of bandwith as you don't have to download the entire body of the page to check.

As a starting point you could use some function like this:

function is_available($url, $timeout = 30) {
    $ch = curl_init(); // get cURL handle

    // set cURL options
    $opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
                  CURLOPT_URL => $url,            // set URL
                  CURLOPT_NOBODY => true,         // do a HEAD request only
                  CURLOPT_TIMEOUT => $timeout);   // set timeout
    curl_setopt_array($ch, $opts); 

    curl_exec($ch); // do it!

    $retval = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200; // check if HTTP OK

    curl_close($ch); // close handle

    return $retval;
}

However, there's a ton of possible optimizations: You might want to re-use the cURL instance and, if checking more than one URL per host, even re-use the connection.

Oh, and this code does check strictly for HTTP response code 200. It does not follow redirects (302) -- but there also is a cURL-option for that.

like image 138
Henning Avatar answered Oct 04 '22 05:10

Henning