Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting timeout on selenium webdriver.PhantomJS

The situation

I have a simple python script to get the HTML source for a given url:

    browser = webdriver.PhantomJS()
    browser.get(url)
    content = browser.page_source

Occasionally, the url points to a page with slow-loading external resources (e.g. video files, or really slow advertising content).

Webdriver will wait until those resources are loaded before completing the .get(url) request.

Note: For extraneous reasons, I need to do this with PhantomJS rather than requests or urllib2


The question

I'd like to set a timeout on PhantomJS resource loading so that if the resource is taking too long to load, the browser just assumes it doesn't exist or whatever.

This would allow me to perform the subsequent .pagesource query based on what the browser has loaded.

Documentation on webdriver.PhantomJS is very thin, and I haven't found a similar question on SO.

thanks in advance!

like image 327
tohster Avatar asked Feb 12 '14 20:02

tohster


2 Answers

Long Explanation below, so TLDR:

Current version of Selenium's Ghostdriver (in PhantomJS 1.9.8) ignores resourceTimeout option, use webdriver's implicitly_wait(), set_page_load_timeout() and wrap them under try-except block.

#Python
from selenium import webdriver
from selenium.common.exceptions import TimeoutException

browser = webdriver.PhantomJS()
browser.implicitly_wait(3)
browser.set_page_load_timeout(3)
try:
    browser.get("http://url_here")
except TimeoutException as e:
    #Handle your exception here
    print(e)
finally:
    browser.quit()

Explanation

To provide PhantomJS page settings to Selenium, one can use webdriver's DesiredCapabilities such as:

#Python
from selenium import webdriver
cap = webdriver.DesiredCapabilities.PHANTOMJS
cap["phantomjs.page.settings.resourceTimeout"] = 1000
cap["phantomjs.page.settings.loadImages"] = False
cap["phantomjs.page.settings.userAgent"] = "faking it"
browser = webdriver.PhantomJS(desired_capabilities=cap)
//Java
DesiredCapabilities capabilities = DesiredCapabilities.phantomjs();
capabilities.setCapability("phantomjs.page.settings.resourceTimeout", 1000);
capabilities.setCapability("phantomjs.page.settings.loadImages", false);
capabilities.setCapability("phantomjs.page.settings.userAgent", "faking it");
WebDriver webdriver = new PhantomJSDriver(capabilities);

But, here's the catch: As in today (2014/Dec/11) with PhantomJS 1.9.8 and its embedded Ghostdriver, resourceTimeout won't be applied by Ghostdriver (See the Ghostdriver issue#380 in Github).

For a workaround, simply use Selenium's timeout functions/methods and wrap webdriver's get method in a try-except/try-catch block, e.g.

#Python
from selenium import webdriver
from selenium.common.exceptions import TimeoutException

browser = webdriver.PhantomJS()
browser.implicitly_wait(3)
browser.set_page_load_timeout(3)
try:
    browser.get("http://url_here")
except TimeoutException as e:
    #Handle your exception here
    print(e)
finally:
    browser.quit()
//Java
WebDriver webdriver = new PhantomJSDriver();
webdriver.manage().timeouts()
        .pageLoadTimeout(3, TimeUnit.SECONDS)
        .implicitlyWait(3, TimeUnit.SECONDS);
try {
    webdriver.get("http://url_here");
} catch (org.openqa.selenium.TimeoutException e) {
    //Handle your exception here
    System.out.println(e.getMessage());
} finally {
    webdriver.quit();
}
like image 174
EwyynTomato Avatar answered Nov 17 '22 11:11

EwyynTomato


PhantomJS has provided resourceTimeout, which might suit your needs. I quote from documentation here

(in milli-secs) defines the timeout after which any resource requested will stop trying and proceed with other parts of the page. onResourceTimeout callback will be called on timeout.

So in Ruby, you can do something like

require 'selenium-webdriver'

capabilities = Selenium::WebDriver::Remote::Capabilities.phantomjs("phantomjs.page.settings.resourceTimeout" => "5000")
driver = Selenium::WebDriver.for :phantomjs, :desired_capabilities => capabilities

I believe in Python, it's something like (untested, only provides the logic, you are the Python developer, hopefully you will figure out)

driver = webdriver.PhantomJS(desired_capabilities={'phantomjs.page.settings.resourceTimeout': '5000'})
like image 11
Yi Zeng Avatar answered Nov 17 '22 09:11

Yi Zeng