Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement a web scraper in PHP? [closed]

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?

like image 571
Chaz Lever Avatar asked Aug 25 '08 21:08

Chaz Lever


People also ask

Can PHP be used for web scraping?

Web scraping lets you collect data from web pages across the internet. It's also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.

Can web scraping be used to pull data off of websites?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.


2 Answers

Scraping generally encompasses 3 steps:

  • first you GET or POST your request to a specified URL
  • next you receive the html that is returned as the response
  • finally you parse out of that html the text you'd like to scrape.

To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.

For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial

My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).

Usage:

  

$curl = new Curl(); $html = $curl->get("http://www.google.com");

// now, do your regex work against $html

PHP Class:

  <?php  class Curl {             public $cookieJar = "";      public function __construct($cookieJarFile = 'cookies.txt') {         $this->cookieJar = $cookieJarFile;     }      function setup()     {           $header = array();         $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";         $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";         $header[] =  "Cache-Control: max-age=0";         $header[] =  "Connection: keep-alive";         $header[] = "Keep-Alive: 300";         $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";         $header[] = "Accept-Language: en-us,en;q=0.5";         $header[] = "Pragma: "; // browsers keep this blank.           curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');         curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);         curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar);          curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);         curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);         curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);         curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);       }       function get($url)     {          $this->curl = curl_init($url);         $this->setup();          return $this->request();     }      function getAll($reg,$str)     {         preg_match_all($reg,$str,$matches);         return $matches[1];     }      function postForm($url, $fields, $referer='')     {         $this->curl = curl_init($url);         $this->setup();         curl_setopt($this->curl, CURLOPT_URL, $url);         curl_setopt($this->curl, CURLOPT_POST, 1);         curl_setopt($this->curl, CURLOPT_REFERER, $referer);         curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);         return $this->request();     }      function getInfo($info)     {         $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);         return $info;     }      function request()     {         return curl_exec($this->curl);     } }  ?>  
like image 152
tyshock Avatar answered Oct 16 '22 08:10

tyshock


I recommend Goutte, a simple PHP Web Scraper.

Example Usage:-

Create a Goutte Client instance (which extends Symfony\Component\BrowserKit\Client):

use Goutte\Client;

$client = new Client();

Make requests with the request() method:

$crawler = $client->request('GET', 'http://www.symfony-project.org/');

The request method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).

Click on links:

$link = $crawler->selectLink('Plugins')->link();
$crawler = $client->click($link);

Submit forms:

$form = $crawler->selectButton('sign in')->form();
$crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));

Extract data:

$nodes = $crawler->filter('.error_list');

if ($nodes->count())
{
  die(sprintf("Authentification error: %s\n", $nodes->text()));
}

printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text());
like image 41
Salman von Abbas Avatar answered Oct 16 '22 09:10

Salman von Abbas