I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and report all of the outbound links. Because the URL is arbitrary, I cannot make any assumptions whatsoever about the contents of the page, so the usual advice (wait for a specific element to be present) is inapplicable.
I have code which is supposed to poll document.readyState
until it reaches "complete" or a 30s timeout has elapsed, and then proceed:
def readystate_complete(d):
# AFAICT Selenium offers no better way to wait for the document to be loaded,
# if one is in ignorance of its contents.
return d.execute_script("return document.readyState") == "complete"
def load_page(driver, url):
try:
driver.get(url)
WebDriverWait(driver, 30).until(readystate_complete)
except WebDriverException:
pass
links = []
try:
for elt in driver.find_elements_by_xpath("//a[@href]"):
try: links.append(elt.get_attribute("href"))
except WebDriverException: pass
except WebDriverException: pass
return links
This sort-of works, but on about one page out of five, the .until
call hangs forever. When this happens, usually the browser has not in fact finished loading the page (the "throbber" is still spinning) but tens of minutes can go by and the timeout does not trigger. But sometimes the page does appear to have loaded completely and the script still does not go on.
What gives? How do I make the timeout work reliably? Is there a better way to request a wait-for-page-to-load (if one cannot make any assumptions about the contents)?
Note: The obsessive catching-and-ignoring of WebDriverException
has proven necessary to ensure that it extracts as many links from the page as possible, whether or not JavaScript inside the page is doing funny stuff with the DOM (e.g. I used to get "stale element" errors in the loop that extracts the HREF attributes).
NOTE: There are a lot of variations on this question both on this site and elsewhere, but they've all either got a subtle but critical difference that makes the answers (if any) useless to me, or I've tried the suggestions and they don't work. Please answer exactly the question I have asked.
We can get Selenium to recognize that a page is loaded. We can set the implicit wait for this purpose. It shall make the driver to wait for a specific amount of time for an element to be available after page loaded.
1. implicitlyWait() This timeout is used to specify the amount of time the driver should wait while searching for an element if it is not immediately present.
The pageLoadTimeout is the method used to set the time for the entire page load prior to throwing an exception. If the timeout time is set to negative, then the time taken to load the page is endless. This timeout is generally used with the navigate and manage methods.
Solution. You can manually increase the wait time by hit-and-trial. If the problem persists for a longer period of time, there may be some other issue and you should continue onto the next solution. You can explicitly add wait by using JavaScript Executor.
The "recommended" (however still ugly) solution could be to use explicit wait:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
old_value = browser.find_element_by_id('thing-on-old-page').text
browser.find_element_by_link_text('my link').click()
WebDriverWait(browser, 3).until(
expected_conditions.text_to_be_present_in_element(
(By.ID, 'thing-on-new-page'),
'expected new text'
)
)
The naive attempt would be something like this:
def wait_for(condition_function):
start_time = time.time()
while time.time() < start_time + 3:
if condition_function():
return True
else:
time.sleep(0.1)
raise Exception(
'Timeout waiting for {}'.format(condition_function.__name__)
)
def click_through_to_new_page(link_text):
browser.find_element_by_link_text('my link').click()
def page_has_loaded():
page_state = browser.execute_script(
'return document.readyState;'
)
return page_state == 'complete'
wait_for(page_has_loaded)
Another, better one would be (credits to @ThomasMarks):
def click_through_to_new_page(link_text):
link = browser.find_element_by_link_text('my link')
link.click()
def link_has_gone_stale():
try:
# poll the link with an arbitrary call
link.find_elements_by_id('doesnt-matter')
return False
except StaleElementReferenceException:
return True
wait_for(link_has_gone_stale)
And the final example includes comparing page ids as below (which could be bulletproof):
class wait_for_page_load(object):
def __init__(self, browser):
self.browser = browser
def __enter__(self):
self.old_page = self.browser.find_element_by_tag_name('html')
def page_has_loaded(self):
new_page = self.browser.find_element_by_tag_name('html')
return new_page.id != self.old_page.id
def __exit__(self, *_):
wait_for(self.page_has_loaded)
And now we can do:
with wait_for_page_load(browser):
browser.find_element_by_link_text('my link').click()
Above code samples are from Harry's blog.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With