Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery)
You want to have a look at phantomjs. There is this php implementation:
http://jonnnnyw.github.io/php-phantomjs/
if you need to have it working with php of course.
You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for contents, etc...). That would depend on your needs, maybe you can simply use the dom, like this:
How to get element by class name?
Here is some working code.
$content = $this->getHeadlessReponse($url);
$this->crawler->addContent($this->getHeadlessReponse($url));
/**
* Get response using a headless browser (phantom in this case).
*
* @param $url
* URL to fetch headless
*
* @return string
* Response.
*/
public function getHeadlessReponse($url) {
// Fetch with phamtomjs
$phantomClient = PhantomClient::getInstance();
// and feed into the crawler.
$request = $phantomClient->getMessageFactory()->createRequest($url, 'GET');
/**
* @see JonnyW\PhantomJs\Http\Response
**/
$response = $phantomClient->getMessageFactory()->createResponse();
// Send the request
$phantomClient->send($request, $response);
if($response->getStatus() === 200) {
// Dump the requested page content
return $response->getContent();
}
}
Only disadvantage of using phantom, it will be slower than guzzle, but of course, you have to wait for all those pesky js to be loaded.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With