I want to enter a very long list of urls and search for specific strings within the source code, outputting a list of urls that contain the string. Sounds simple enough right? I have come up with the bellow code, the input being a html form. You can try it at pelican-cement.com/findfrog.
It seems to work half the time, but is thrown off by multiple urls/urls in different orders. Searching for 'adsense' it correctly ids politics1.com out of
cnn.com
politics1.com
however, if reversed the output is blank. How can I get reliable, consistent results? preferably something I could input thousands of urls into?
<html>
<body>
<?
set_time_limit (0);
$urls=explode("\n", $_POST['url']);
$allurls=count($urls);
for ( $counter = 0; $counter <= $allurls; $counter++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$urls[$counter]);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_exec ($ch);
$curl_scraped_page=curl_exec($ch);
$haystack=strtolower($curl_scraped_page);
$needle=$_POST['proxy'];
if (strlen(strstr($haystack,$needle))>0) {
echo $urls[$counter];
echo "<br/>";
curl_close($ch);
}
}
//$FileNameSQL = "/googleresearch" . abs(rand(0,1000000000000000)) . ".csv";
//$query = "SELECT * FROM happyturtle INTO OUTFILE '$FileNameSQL' FIELDS TERMINATED BY ','";
//$result = mysql_query($query) or die(mysql_error());
//exit;
echo '$FileNameSQL';
?>
</body>
</html>
Reorganized your code a bit. The main culprit was whitespace. You need to trim your URL string before using it (i.e. trim($url);).
Other changes:
The code below can be run on my quick mockup.
<html>
<body>
<form action="search.php" method="post">
URLs: <br/>
<textarea rows="20" cols="50" input type="text" name="url" /></textarea><br/>
Search Term: <br/>
<textarea rows="20" cols="50" input type="text" name="proxy" /></textarea><br/>
<input type="submit" />
</form>
<?
if(isset($_POST['url'])) {
set_time_limit (0);
$urls = explode("\n", $_POST['url']);
$term = $_POST['proxy'];
$options = array( CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_CUSTOMREQUEST => 'GET',
CURLOPT_HEADER => 1,
);
$ch = curl_init();
curl_setopt_array($ch, $options);
foreach ($urls as $url) {
curl_setopt($ch, CURLOPT_URL, trim($url));
$html = curl_exec($ch);
if ($html !== FALSE && stristr($html, $term) !== FALSE) { // Found!
echo $url;
}
}
curl_close($ch);
}
?>
</body>
</html>
Perhaps you should call
curl_close($ch);
Regardless of whether it finds the string in the scraped page or not. Aside from that I don't see anything obviously wrong with the code.
If its not something in the code, then its probably some difference in the scraped page. Maybe the page is dynamic, and doesn't always contain the needle word on subsequent checks. Maybe the server of the page you were trying to scrape returned an error code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With