Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prevent CSS/other resource download in PhantomJS/Selenium driven by Python

I'm trying to speed up Selenium/PhantomJS webscraper in Python by preventing download of CSS/other resources. All I need to download is img src and alt tags. I've found this code:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

via: How can I control PhantomJS to skip download some kind of resource?

How/where can I implement this code in Selenium driven by Python? Or, is there another better way to stop CSS/other resources from downloading?

Note: I've already found how to prevent image download by editing service_args variable via:

How do I set a proxy for phantomjs/ghostdriver in python webdriver?

and

PhantomJS 1.8 with Selenium on python. How to block images?

But service_args can't help me with resources like CSS. Thanks!

like image 346
YPCrumble Avatar asked Sep 30 '13 16:09

YPCrumble


2 Answers

A bold young soul by the name of “watsonmw” recently added functionality to Ghostdriver (which Phantom.js uses to interface with Selenium) that allows access to Phantom.js API calls which require a page object, like the onResourceRequested one you cited.

For a solution at all costs, consider building from source (which developers note “takes roughly 30 minutes ... with 4 parallel compile jobs on a modern machine”) and integrating his patch, linked above.

Then this (untested) Python code should work as a proof of concept:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

Until then, you’ll just get a Can't find variable: page exception.

Good luck! There are a lot of great alternatives, like working in a Javascript environment, driving Gecko, proxies, etc.

like image 161
Will McChesney Avatar answered Nov 12 '22 05:11

Will McChesney


Will's answer got me on track. (Thanks Will!)

Current PhantomJS (1.9.8) includes Ghostdriver 1.1.0 which already contains watsonmw's patch.

You need to download the latest PhantomJS, perform the following (sudo may be required):

ln -s path/to/bin/phantomjs  /usr/local/share/phantomjs
ln -s path/to/bin/phantomjs  /usr/local/bin/phantomjs
ln -s path/to/bin/phantomjs  /usr/bin/phantomjs

And then try this:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
    var page = this; // won't work otherwise
    page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})
like image 31
MaratC Avatar answered Nov 12 '22 06:11

MaratC