I am using PHP to try and scrape a page that seems to dynamically load content just milliseconds after the parent page finishes loading.
I am using curl to parse the page, and simpleHtmlDom to snatch things from the parsed html.
My efforts to traverse the DOM and explode() things out of the html return nothing. My only ideas were that it was loading the content after the parent page was loaded.
Here is my code.
<?
$url = 'http://www.facebook.com/OneAndroidAppaDay';
$scrapeUrl = 'http://www.facebook.com/OneAndroidAppaDay';
include_once('simple_html_dom.php');
require_once("bitly.php");
$userAgent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$scrapeUrl);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
$appBitlyUrl = $html->find('div[class=UIStoryAttachment_Title]',0)->find('a',0)->href; // fail :(
echo 'Bitly Url: ' . $appBitlyUrl;
?>
It's bombing out at line 24 (denoted with the inline comment) with this error:
Fatal error: Call to a member function find() on a non-object in /home/xxxxxxxx/public_html/xxx.xx/xxxx.php on line 24
Is there a way to make it wait a second or two before it snatches the page's html? Or maybe someone has some better insight?
Thanks
Mark
to do a simple delay
sleep(2); // 2 second delay before continuing
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With