Code to find strings in source code over many urls

Question

I want to enter a very long list of urls and search for specific strings within the source code, outputting a list of urls that contain the string. Sounds simple enough right? I have come up with the bellow code, the input being a html form. You can try it at pelican-cement.com/findfrog.

It seems to work half the time, but is thrown off by multiple urls/urls in different orders. Searching for 'adsense' it correctly ids politics1.com out of

cnn.com
politics1.com

however, if reversed the output is blank. How can I get reliable, consistent results? preferably something I could input thousands of urls into?

<html>
<body>

<?
set_time_limit (0);

$urls=explode("
", $_POST['url']);

$allurls=count($urls);

for ( $counter = 0; $counter <= $allurls; $counter++) {

 $ch = curl_init();
 curl_setopt($ch, CURLOPT_URL,$urls[$counter]);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
 curl_setopt ($ch, CURLOPT_HEADER, 1); 
 curl_exec ($ch); 
 $curl_scraped_page=curl_exec($ch); 

$haystack=strtolower($curl_scraped_page);
$needle=$_POST['proxy'];
if (strlen(strstr($haystack,$needle))>0) {

echo $urls[$counter];
echo "<br/>";
curl_close($ch);
}
}




//$FileNameSQL = "/googleresearch" .  abs(rand(0,1000000000000000))  .  ".csv";
//$query = "SELECT * FROM happyturtle INTO OUTFILE '$FileNameSQL' FIELDS TERMINATED BY ','";
//$result = mysql_query($query) or die(mysql_error());

//exit;

echo '$FileNameSQL';





?>

</body>
</html>

James An · Accepted Answer

Reorganized your code a bit. The main culprit was whitespace. You need to trim your URL string before using it (i.e. trim($url);).

Other changes:

Set your search term outside the for loop, since it never changes.
Setup the curl object outside the loop and reuse it by just changing the URL each time.
Use curl_setopt_array() to set multiple curl options in one statement.
Use a foreach loop, since you're iterating over the entire array anyway and the code is cleaner.
Using stripos() is more efficient than strstr() and is case-insensitive anyway.
Use the !== comparator to prevent implied typecasting (FALSE !== 0, but FALSE == 0).
Check the returned $html string as curl_exec() can return FALSE if it fails.
Close the curl object at the end (i.e. outside the if statement too).

The code below can be run on my quick mockup.

<html>
<body>

<form action="search.php" method="post"> 
  URLs: <br/>
  <textarea rows="20" cols="50" input type="text" name="url" /></textarea><br/>

  Search Term: <br/>
  <textarea rows="20" cols="50" input type="text" name="proxy" /></textarea><br/>

  <input type="submit" /> 
</form>

<?
  if(isset($_POST['url'])) {
    set_time_limit (0);

    $urls = explode("
", $_POST['url']);
    $term = $_POST['proxy'];
    $options = array( CURLOPT_FOLLOWLOCATION => 1,
                      CURLOPT_RETURNTRANSFER => 1,
                      CURLOPT_CUSTOMREQUEST  => 'GET',
                      CURLOPT_HEADER         => 1,
                      );
    $ch = curl_init();
    curl_setopt_array($ch, $options);

    foreach ($urls as $url) {
      curl_setopt($ch, CURLOPT_URL, trim($url));
      $html = curl_exec($ch);

      if ($html !== FALSE && stristr($html, $term) !== FALSE) { // Found!
        echo $url;
      }
    }

    curl_close($ch);
  }
?>

</body>
</html>

profitphp · Answer

Perhaps you should call

curl_close($ch);

Regardless of whether it finds the string in the scraped page or not. Aside from that I don't see anything obviously wrong with the code.

If its not something in the code, then its probably some difference in the scraped page. Maybe the page is dynamic, and doesn't always contain the needle word on subsequent checks. Maybe the server of the page you were trying to scrape returned an error code.

Code to find strings in source code over many urls

Tags:

php

curl

web-scraping

explode

strstr

user586011

2 Answers

James An

profitphp

Recent Activity

Donate For Us

Code to find strings in source code over many urls

Tags:

php

curl

web-scraping

explode

strstr

user586011

2 Answers

James An

profitphp

Related questions

Recent Activity

Donate For Us