Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape iframe content using cURL

Goal: I want to scrape the word "Paris" inside an iframe using cURL.

Say you have a simple page containing an iframe:

<html>
<head>
<title>Curl into this page</title>
</head>
<body>

<iframe src="france.html" title="test" name="test">

</body>
</html>

The iframe page:

<html>
<head>
<title>France</title>
</head>
<body>

<p>The Capital of France is: Paris</p>

</body>
</html>

My cURL script:

<?php>

// 1. initialize

$ch = curl_init();

// 2. The URL containing the iframe

$url = "http://localhost/test/index.html";

// 3. set the options, including the url

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 2);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// 4. execute and fetch the resulting HTML output by putting into $output

$output = curl_exec($ch);

// 5. free up the curl handle

curl_close($ch);

// 6. Scrape for a single string/word ("Paris") 

preg_match("'The Capital of France is:(.*?). </p>'si", $output, $match);
if($match) 

// 7. Display the scraped string 

echo "The Capital of France is: ".$match[1];

?>

Result = nothing!

Can someone help me find out the capital of France?! ;)

I need example of:

  1. parsing/grabbing the iframe url
  2. curling the url (as I've done with the index.html page)
  3. parsing for the string "Paris"

Thanks!

like image 714
ven Avatar asked Feb 22 '23 04:02

ven


1 Answers

note that occasionally for a variety of reasons the iframe curl can't be read outside the context of their own server and looking at the curl directly throws some type of 'can't be read directly or externally' error message.

in these cases, you can use curl_setopt($ch, CURLOPT_REFERER, $fullpageurl); (if you're in php and reading the text using curl_exec) and then curl_exec thinks the iframe is in the original page and you can read the source.

so if for whatever reason france.html couldn't be read outside the context of the larger page that included it as an iframe, you can still get the source using methods above using CURLOPT_REFERER and setting the main page (test/index.html in the original question) as the referrer.

like image 192
Barry Avatar answered Mar 30 '23 05:03

Barry