Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cURL get url from redirect

I'm currently using cURL to try and get the URL from a redirect for a website scraper. I only need the url from the website. I've researched on stackoverflow and other sites for the past couple days and have been unsuccessful. The code I'm currently using is from this website:

  $url = "http://www.someredirect.com";
  $ch = curl_init($url);
  curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1');         
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($ch, CURLOPT_HEADER, true);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
  curl_setopt($ch, CURLOPT_NOBODY, true);
  $response = curl_exec($ch);
  preg_match_all('/^Location:(.*)$/mi', $response, $matches);
  curl_close($ch);
  echo !empty($matches[1]) ? trim($matches[1][0]) : 'No redirect found';

Any help would be greatly appreciated!

like image 350
Josh Avatar asked Jun 10 '13 14:06

Josh


2 Answers

In your particular case, the server is checking for certain user-agent strings.

When a server checks the user-agent string, it will only respond with a 302 redirect status code when the server sees a "valid" (according to the server) user-agent. Any "invalid" user-agents will not receive the 302 redirect status code response or Location: header.

In your particular case, when the server receives a request from an "invalid" user-agent it responds with a 200 OK status code with no text in the response body.

(Note: in the code below, the actual URLs provided have been replaced with examples.)

Let's say that http://www.example.com's server checks the User-Agent string and that http://www.example.com/product/123/ redirects to http://www.example.org/abc.

In PHP your solution would be:

<?php

$url = 'http://www.example.com/product/123/';

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0"); // Necessary. The server checks for a valid User-Agent.
curl_exec($ch);

$response = curl_exec($ch);
preg_match_all('/^Location:(.*)$/mi', $response, $matches);
curl_close($ch);

echo !empty($matches[1]) ? trim($matches[1][0]) : 'No redirect found';

And, the output of this script would be: http://www.example.org/abc.

like image 130
cmt Avatar answered Oct 31 '22 13:10

cmt


Try using this code:

function curl_last_url(/*resource*/ $ch, /*int*/ &$maxredirect = null) { 
$mr = $maxredirect === null ? 5 : intval($maxredirect); 
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); 
    if ($mr > 0) { 
        echo $mr;
        echo $newurl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); 

        $rch = curl_copy_handle($ch); 
        curl_setopt($rch, CURLOPT_HEADER, true); 
        curl_setopt($rch, CURLOPT_NOBODY, true); 
        curl_setopt($rch, CURLOPT_FORBID_REUSE, false); 
        curl_setopt($rch, CURLOPT_RETURNTRANSFER, true); 
        do { 
            curl_setopt($rch, CURLOPT_URL, $newurl); 
            $header = curl_exec($rch); 
            if (curl_errno($rch)) { 
                $code = 0; 
            } else { 
                $code = curl_getinfo($rch, CURLINFO_HTTP_CODE); 
                echo $code;
                if ($code == 301 || $code == 302) { 
                    preg_match('/Location:(.*?)\n/', $header, $matches); 
                    $newurl = trim(array_pop($matches)); 
                } else { 
                    $code = 0; 
                } 
            } 
        } while ($code && --$mr); 
        curl_close($rch); 
        if (!$mr) { 
            if ($maxredirect === null) { 
                trigger_error('Too many redirects. When following redirects, libcurl hit the maximum amount.', E_USER_WARNING); 
            } else { 
                $maxredirect = 0; 
            } 
            return false; 
        } 
        curl_setopt($ch, CURLOPT_URL, $newurl); 
    } 
return $newurl; 

}

like image 35
bukvoed Avatar answered Oct 31 '22 12:10

bukvoed