How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

Question

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery)

Alejandro Moreno · Accepted Answer

You want to have a look at phantomjs. There is this php implementation:

http://jonnnnyw.github.io/php-phantomjs/

if you need to have it working with php of course.

You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for contents, etc...). That would depend on your needs, maybe you can simply use the dom, like this:

How to get element by class name?

Here is some working code.

  $content = $this->getHeadlessReponse($url);
  $this->crawler->addContent($this->getHeadlessReponse($url));

  /**
   * Get response using a headless browser (phantom in this case).
   *
   * @param $url
   *   URL to fetch headless
   *
   * @return string
   *   Response.
   */
public function getHeadlessReponse($url) {
    // Fetch with phamtomjs
    $phantomClient = PhantomClient::getInstance();
    // and feed into the crawler.
    $request = $phantomClient->getMessageFactory()->createRequest($url, 'GET');

    /**
     * @see JonnyW\PhantomJs\Http\Response
     **/
    $response = $phantomClient->getMessageFactory()->createResponse();

    // Send the request
    $phantomClient->send($request, $response);

    if($response->getStatus() === 200) {
        // Dump the requested page content
        return $response->getContent();
    }

}

Only disadvantage of using phantom, it will be slower than guzzle, but of course, you have to wait for all those pesky js to be loaded.

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

Tags:

php

web-crawler

guzzle

scraper

goutte

Batman

1 Answers

Alejandro Moreno

Recent Activity

Donate For Us

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

Tags:

php

web-crawler

guzzle

scraper

goutte

Batman

1 Answers

Alejandro Moreno

Related questions

Recent Activity

Donate For Us