I'm trying to speed up Selenium/PhantomJS webscraper in Python by preventing download of CSS/other resources. All I need to download is img src and alt tags. I've found this code:
page.onResourceRequested = function(requestData, request) {
if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
console.log('The url of the request is matching. Aborting: ' + requestData['url']);
request.abort();
}
};
via: How can I control PhantomJS to skip download some kind of resource?
How/where can I implement this code in Selenium driven by Python? Or, is there another better way to stop CSS/other resources from downloading?
Note: I've already found how to prevent image download by editing service_args variable via:
How do I set a proxy for phantomjs/ghostdriver in python webdriver?
and
PhantomJS 1.8 with Selenium on python. How to block images?
But service_args can't help me with resources like CSS. Thanks!
A bold young soul by the name of “watsonmw” recently added functionality to Ghostdriver (which Phantom.js uses to interface with Selenium) that allows access to Phantom.js API calls which require a page object, like the onResourceRequested
one you cited.
For a solution at all costs, consider building from source (which developers note “takes roughly 30 minutes ... with 4 parallel compile jobs on a modern machine”) and integrating his patch, linked above.
Then this (untested) Python code should work as a proof of concept:
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
// ...
}
''', 'args': []})
Until then, you’ll just get a Can't find variable: page
exception.
Good luck! There are a lot of great alternatives, like working in a Javascript environment, driving Gecko, proxies, etc.
Will's answer got me on track. (Thanks Will!)
Current PhantomJS (1.9.8) includes Ghostdriver 1.1.0 which already contains watsonmw's patch.
You need to download the latest PhantomJS, perform the following (sudo
may be required):
ln -s path/to/bin/phantomjs /usr/local/share/phantomjs
ln -s path/to/bin/phantomjs /usr/local/bin/phantomjs
ln -s path/to/bin/phantomjs /usr/bin/phantomjs
And then try this:
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.execute('executePhantomScript', {'script': '''
var page = this; // won't work otherwise
page.onResourceRequested = function(requestData, request) {
// ...
}
''', 'args': []})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With