What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?
Web scraping lets you collect data from web pages across the internet. It's also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.
Scraping generally encompasses 3 steps:
To accomplish steps 1 and 2, below is a simple php class which uses Curl to fetch webpages using either GET or POST. After you get the HTML back, you just use Regular Expressions to accomplish step 3 by parsing out the text you'd like to scrape.
For regular expressions, my favorite tutorial site is the following: Regular Expressions Tutorial
My Favorite program for working with RegExs is Regex Buddy. I would advise you to try the demo of that product even if you have no intention of buying it. It is an invaluable tool and will even generate code for your regexs you make in your language of choice (including php).
Usage:
$curl = new Curl(); $html = $curl->get("http://www.google.com");
// now, do your regex work against $html
PHP Class:
<?php class Curl { public $cookieJar = ""; public function __construct($cookieJarFile = 'cookies.txt') { $this->cookieJar = $cookieJarFile; } function setup() { $header = array(); $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,"; $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; $header[] = "Cache-Control: max-age=0"; $header[] = "Connection: keep-alive"; $header[] = "Keep-Alive: 300"; $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7"; $header[] = "Accept-Language: en-us,en;q=0.5"; $header[] = "Pragma: "; // browsers keep this blank. curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'); curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header); curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar); curl_setopt($this->curl,CURLOPT_AUTOREFERER, true); curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true); curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true); } function get($url) { $this->curl = curl_init($url); $this->setup(); return $this->request(); } function getAll($reg,$str) { preg_match_all($reg,$str,$matches); return $matches[1]; } function postForm($url, $fields, $referer='') { $this->curl = curl_init($url); $this->setup(); curl_setopt($this->curl, CURLOPT_URL, $url); curl_setopt($this->curl, CURLOPT_POST, 1); curl_setopt($this->curl, CURLOPT_REFERER, $referer); curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields); return $this->request(); } function getInfo($info) { $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info); return $info; } function request() { return curl_exec($this->curl); } } ?>
I recommend Goutte, a simple PHP Web Scraper.
Create a Goutte Client instance (which extends
Symfony\Component\BrowserKit\Client
):
use Goutte\Client;
$client = new Client();
Make requests with the request()
method:
$crawler = $client->request('GET', 'http://www.symfony-project.org/');
The request
method returns a Crawler
object
(Symfony\Component\DomCrawler\Crawler
).
Click on links:
$link = $crawler->selectLink('Plugins')->link();
$crawler = $client->click($link);
Submit forms:
$form = $crawler->selectButton('sign in')->form();
$crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));
Extract data:
$nodes = $crawler->filter('.error_list');
if ($nodes->count())
{
die(sprintf("Authentification error: %s\n", $nodes->text()));
}
printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text());
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With