Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does cURL return an empty string?

Tags:

I'm having a problem with PHP's cURL returning an empty string with some URL's. I'm trying to parse the OG metadata of different webpages and it works with all websites I've tried except for NYTimes. Here is my code so far.

print_r(get_og_metadata('http://somewebsite.com'));


public function get_data($url)
{
    $ch = curl_init();
    $timeout = 5;
    // the url to fetch
    curl_setopt($ch, CURLOPT_URL, $url);
    // return result as a string rather than direct output
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    // set max time of cURL execution
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

public function get_og_metadata($url)
{
    libxml_use_internal_errors(TRUE);
    $data = $this->_get_data($url);
    $doc = new DOMDocument();
    $doc->loadHTML($data);

    $xpath = new DOMXPath($doc);
    $query = '//*/meta[starts-with(@property, \'og:\')]';

    $metadatas = $xpath->query($query);
    $result = array();
    foreach($metadatas as $metadata)
    {
        $property = $metadata->getAttribute('property');
        $content = $metadata->getAttribute('content');
        $result[$property] = $content;
    }

    return $result;
}
like image 886
Nick Avatar asked Feb 04 '13 03:02

Nick


3 Answers

These 5 lines did the magic for me.

   curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
   curl_setopt($ch, CURLOPT_AUTOREFERER, true); 
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
   curl_setopt($ch, CURLOPT_VERBOSE, 1);
like image 88
Abhishek Goel Avatar answered Sep 29 '22 11:09

Abhishek Goel


My guess is that a site like the New York times has protection against such behavior. Most likely this is based on the user agent, which you can fake as so:

curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');

This is the most common agent btw.

like image 21
ZirconCode Avatar answered Sep 29 '22 12:09

ZirconCode


(That other answer is me also)

This is what did it for me. It was looking for SSL verificaiton, which I happened to not need in this specific case.

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
like image 40
Michael Davidson Avatar answered Sep 29 '22 11:09

Michael Davidson