Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reliably detect page load or time out, Selenium 2

I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and report all of the outbound links. Because the URL is arbitrary, I cannot make any assumptions whatsoever about the contents of the page, so the usual advice (wait for a specific element to be present) is inapplicable.

I have code which is supposed to poll document.readyState until it reaches "complete" or a 30s timeout has elapsed, and then proceed:

def readystate_complete(d):
    # AFAICT Selenium offers no better way to wait for the document to be loaded,
    # if one is in ignorance of its contents.
    return d.execute_script("return document.readyState") == "complete"

def load_page(driver, url):
    try:
        driver.get(url)
        WebDriverWait(driver, 30).until(readystate_complete)
    except WebDriverException:
        pass

    links = []
    try:
        for elt in driver.find_elements_by_xpath("//a[@href]"):
            try: links.append(elt.get_attribute("href"))
            except WebDriverException: pass
    except WebDriverException: pass
    return links

This sort-of works, but on about one page out of five, the .until call hangs forever. When this happens, usually the browser has not in fact finished loading the page (the "throbber" is still spinning) but tens of minutes can go by and the timeout does not trigger. But sometimes the page does appear to have loaded completely and the script still does not go on.

What gives? How do I make the timeout work reliably? Is there a better way to request a wait-for-page-to-load (if one cannot make any assumptions about the contents)?

Note: The obsessive catching-and-ignoring of WebDriverException has proven necessary to ensure that it extracts as many links from the page as possible, whether or not JavaScript inside the page is doing funny stuff with the DOM (e.g. I used to get "stale element" errors in the loop that extracts the HREF attributes).

NOTE: There are a lot of variations on this question both on this site and elsewhere, but they've all either got a subtle but critical difference that makes the answers (if any) useless to me, or I've tried the suggestions and they don't work. Please answer exactly the question I have asked.

like image 734
zwol Avatar asked Sep 10 '13 22:09

zwol


People also ask

How do you check if page is completely loaded in Selenium?

We can get Selenium to recognize that a page is loaded. We can set the implicit wait for this purpose. It shall make the driver to wait for a specific amount of time for an element to be available after page loaded.

What is driver manage () timeouts () implicitlyWait?

1. implicitlyWait() This timeout is used to specify the amount of time the driver should wait while searching for an element if it is not immediately present.

What all is true about pageLoadTimeout in Selenium?

The pageLoadTimeout is the method used to set the time for the entire page load prior to throwing an exception. If the timeout time is set to negative, then the time taken to load the page is endless. This timeout is generally used with the navigate and manage methods.

How does Selenium handle Timeoutexception?

Solution. You can manually increase the wait time by hit-and-trial. If the problem persists for a longer period of time, there may be some other issue and you should continue onto the next solution. You can explicitly add wait by using JavaScript Executor.


1 Answers

  1. The "recommended" (however still ugly) solution could be to use explicit wait:

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait 
    from selenium.webdriver.support import expected_conditions
    
    old_value = browser.find_element_by_id('thing-on-old-page').text
    browser.find_element_by_link_text('my link').click()
    WebDriverWait(browser, 3).until(
        expected_conditions.text_to_be_present_in_element(
            (By.ID, 'thing-on-new-page'),
            'expected new text'
        )
    )
    
  2. The naive attempt would be something like this:

    def wait_for(condition_function):
        start_time = time.time()
        while time.time() < start_time + 3:
            if condition_function():
                return True
            else:
                time.sleep(0.1)
        raise Exception(
            'Timeout waiting for {}'.format(condition_function.__name__)
        )
    
    
    def click_through_to_new_page(link_text):
        browser.find_element_by_link_text('my link').click()
    
        def page_has_loaded():
            page_state = browser.execute_script(
                'return document.readyState;'
            ) 
            return page_state == 'complete'
    
        wait_for(page_has_loaded)
    
  3. Another, better one would be (credits to @ThomasMarks):

    def click_through_to_new_page(link_text):
        link = browser.find_element_by_link_text('my link')
        link.click()
    
        def link_has_gone_stale():
            try:
                # poll the link with an arbitrary call
                link.find_elements_by_id('doesnt-matter') 
                return False
            except StaleElementReferenceException:
                return True
    
        wait_for(link_has_gone_stale)
    
  4. And the final example includes comparing page ids as below (which could be bulletproof):

    class wait_for_page_load(object):
    
        def __init__(self, browser):
            self.browser = browser
    
        def __enter__(self):
            self.old_page = self.browser.find_element_by_tag_name('html')
    
        def page_has_loaded(self):
            new_page = self.browser.find_element_by_tag_name('html')
            return new_page.id != self.old_page.id
    
        def __exit__(self, *_):
            wait_for(self.page_has_loaded)
    

    And now we can do:

    with wait_for_page_load(browser):
        browser.find_element_by_link_text('my link').click()
    

Above code samples are from Harry's blog.

like image 100
kenorb Avatar answered Oct 13 '22 11:10

kenorb