Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery)

like image 589
Batman Avatar asked Apr 17 '16 07:04

Batman


1 Answers

You want to have a look at phantomjs. There is this php implementation:

http://jonnnnyw.github.io/php-phantomjs/

if you need to have it working with php of course.

You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for contents, etc...). That would depend on your needs, maybe you can simply use the dom, like this:

How to get element by class name?

Here is some working code.

  $content = $this->getHeadlessReponse($url);
  $this->crawler->addContent($this->getHeadlessReponse($url));

  /**
   * Get response using a headless browser (phantom in this case).
   *
   * @param $url
   *   URL to fetch headless
   *
   * @return string
   *   Response.
   */
public function getHeadlessReponse($url) {
    // Fetch with phamtomjs
    $phantomClient = PhantomClient::getInstance();
    // and feed into the crawler.
    $request = $phantomClient->getMessageFactory()->createRequest($url, 'GET');

    /**
     * @see JonnyW\PhantomJs\Http\Response
     **/
    $response = $phantomClient->getMessageFactory()->createResponse();

    // Send the request
    $phantomClient->send($request, $response);

    if($response->getStatus() === 200) {
        // Dump the requested page content
        return $response->getContent();
    }

}

Only disadvantage of using phantom, it will be slower than guzzle, but of course, you have to wait for all those pesky js to be loaded.

like image 116
Alejandro Moreno Avatar answered Oct 03 '22 01:10

Alejandro Moreno