phantomjs has config loadImage,
but I want more,
how can I control phantomjs to skip download some kind of resource,
such as css etc...
=====
good news: this feature is added.
https://code.google.com/p/phantomjs/issues/detail?id=230
The gist:
page.onResourceRequested = function(requestData, request) { if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') { console.log('The url of the request is matching. Aborting: ' + requestData['url']); request.abort(); } };
UPDATED, Working!
Since PhantomJS 1.9, the existing answer didn't work. You must use this code:
var webPage = require('webpage'); var page = webPage.create(); page.onResourceRequested = function(requestData, networkRequest) { var match = requestData.url.match(/wordfamily.js/g); if (match != null) { console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData)); networkRequest.cancel(); // or .abort() } };
If you use abort() instead of cancel(), it will trigger onResourceError.
You can look at the PhantomJS docs
So finally you can try this http://github.com/eugenehp/node-crawler
otherwise you can still try the below approach with PhantomJS
The easy way, is to load page -> parse page -> exclude unwanted resource -> load it into PhatomJS.
Another way is just simply block the hosts in the firewall.
Optionally you can use a proxy to block certain URL addresses and queries to them.
And additional one, load the page, and then remove the unwanted resources, but I think its not the right approach here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With