Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping Library for PHP - phpQuery?

I'm looking for a PHP library that allows me to scrap webpages and takes care about all the cookies and prefilling the forms with the default values, that's what annoys me the most.

I'm tired of having to match every single input element with xpath and I would love if something better existed. I've come across phpQuery but the manual isn't much clear and I can't find out how to make POST requests.

Can someone help me? Thanks.

@Jonathan Fingland:

In the example provided by the manual for browserGet() we have:

require_once('phpQuery/phpQuery.php');

phpQuery::browserGet('http://google.com/', 'success1');

function success1($browser)
{
    $browser->WebBrowser('success2')
    ->find('input[name=q]')->val('search phrase')
    ->parents('form')
    ->submit();
}

function success2($browser)
{
    echo $browser;
}

I suppose all the other fields are scrapped and send back in the GET request, I want to do the same with the phpQuery::browserPost() method but I don't know how to do it. The form I'm trying to scrape has a input token and I would love if phpQuery could be smart enough to scrape the token and just let me change the other fields (in this case username and password), submiting via POST everything.

PS: Rest assured, this is not going to be used for spamming.

like image 499
Alix Axel Avatar asked Oct 29 '09 15:10

Alix Axel


1 Answers

See http://code.google.com/p/phpquery/wiki/Ajax and in particular:

phpQuery::post($url, $data, $callback, $type)

and

# data Object, String which defines the data parameter as being either an Object or a String. POST requests should be possible using query string format, e.g.:

$data = "username=Jon&password=123456";
$url = "http://www.mysite.com/login.php";
phpQuery::post($url, $data, $callback, $type)

as phpQuery is a jQuery port the method signature is the same (the docs link directly to the jquery site -- http://docs.jquery.com/Ajax/jQuery.post)

Edit

Two things:

There is also a phpQuery::browserPost function which might meet your needs better.

However, also note that the success2 callback is only called on the submit() or click() methods so you can fill in all of the form fields prior to that.

e.g.

require_once('phpQuery/phpQuery.php');
phpQuery::browserGet('http://www.mysite.com/login.php', 'success1');
function success1($browser) {
  $handle = $browser
    ->WebBrowser('success2');
  $handle 
    ->find('input[name=username]')
      ->val('Jon');
  $handle 
    ->find('input[name=password]')
      ->val('123456');
      ->parents('form')
        ->submit();
}
function success2($browser) {
  print $browser;
}

(Note that this has not been tested, but should work)

like image 174
Jonathan Fingland Avatar answered Oct 06 '22 00:10

Jonathan Fingland