I'm trying to be a bit sneeky and as part of a learning process try and improve my page scraping skills.
One thing i've come across that I have yet to be able to solve is that certain sites will use an internal link which then redirects to an external link.
What I want to do is modify some curl code to follow the redirects until they stop and then obtain the final resting place URL.
Anyone recommend some code for me?
I have this at the moment, but it's not following the redirects properly at the moment.
$opts = array(CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => true,
CURLOPT_FOLLOWLOCATION => true);
$curl = curl_init();
curl_setopt_array($curl, $opts);
$str = curl_exec($curl);
curl_close($curl);
To follow redirect with Curl, use the -L or --location command-line option. This flag tells Curl to resend the request to the new address. When you send a POST request, and the server responds with one of the codes 301, 302, or 303, Curl will make the subsequent request using the GET method.
In HTTP, redirection is triggered by a server sending a special redirect response to a request. Redirect responses have status codes that start with 3 , and a Location header holding the URL to redirect to. When browsers receive a redirect, they immediately load the new URL provided in the Location header.
http.//php.net/manual/en/ref.curl.php
function get_final_url( $url, $timeout = 5 )
{
$url = str_replace( "&", "&", urldecode(trim($url)) );
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );
if ($response['http_code'] == 301 || $response['http_code'] == 302)
{
ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
$headers = get_headers($response['url']);
$location = "";
foreach( $headers as $value )
{
if ( substr( strtolower($value), 0, 9 ) == "location:" )
return get_final_url( trim( substr( $value, 9, strlen($value) ) ) );
}
}
if ( preg_match("/window\.location\.replace\('(.*)'\)/i", $content, $value) ||
preg_match("/window\.location\=\"(.*)\"/i", $content, $value)
)
{
return get_final_url ( $value[1] );
}
else
{
return $response['url'];
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With