Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I control PhantomJS to skip download some kind of resource?

Tags:

phantomjs

phantomjs has config loadImage,

but I want more,

how can I control phantomjs to skip download some kind of resource,

such as css etc...

=====

good news: this feature is added.

https://code.google.com/p/phantomjs/issues/detail?id=230

The gist:

page.onResourceRequested = function(requestData, request) {     if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {         console.log('The url of the request is matching. Aborting: ' + requestData['url']);         request.abort();     } }; 
like image 623
atian25 Avatar asked Feb 28 '12 17:02

atian25


2 Answers

UPDATED, Working!

Since PhantomJS 1.9, the existing answer didn't work. You must use this code:

var webPage = require('webpage'); var page = webPage.create();  page.onResourceRequested = function(requestData, networkRequest) {   var match = requestData.url.match(/wordfamily.js/g);   if (match != null) {     console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData));     networkRequest.cancel(); // or .abort()    } }; 

If you use abort() instead of cancel(), it will trigger onResourceError.

You can look at the PhantomJS docs

like image 160
webo80 Avatar answered Oct 01 '22 19:10

webo80


So finally you can try this http://github.com/eugenehp/node-crawler

otherwise you can still try the below approach with PhantomJS

The easy way, is to load page -> parse page -> exclude unwanted resource -> load it into PhatomJS.

Another way is just simply block the hosts in the firewall.

Optionally you can use a proxy to block certain URL addresses and queries to them.

And additional one, load the page, and then remove the unwanted resources, but I think its not the right approach here.

like image 37
Eugene Hauptmann Avatar answered Oct 01 '22 17:10

Eugene Hauptmann