I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and report all of the outbound links. Because the URL is arbitrary, I cannot make any assumptions whatsoever about the contents of the page, so the usual advice (wait for a specific element to be present) is inapplicable. I have code which is supposed to poll <code>document.readyState</code> until it reaches "complete" or a 30s timeout has elapsed, and then proceed: <pre class="prettyprint"><code>def readystate_complete(d): # AFAICT Selenium offers no better way to wait for the document to be loaded, # if one is in ignorance of its contents. return d.execute_script("return document.readyState") == "complete" def load_page(driver, url): try: driver.get(url) WebDriverWait(driver, 30).until(readystate_complete) except WebDriverException: pass links = [] try: for elt in driver.find_elements_by_xpath("//a[@href]"): try: links.append(elt.get_attribute("href")) except WebDriverException: pass except WebDriverException: pass return links </code></pre> This sort-of works, but on about one page out of five, the <code>.until</code> call hangs forever. When this happens, usually the browser has not in fact finished loading the page (the "throbber" is still spinning) but tens of minutes can go by and the timeout does not trigger. But sometimes the page does appear to have loaded completely and the script still does not go on. What gives? How do I make the timeout work reliably? Is there a better way to request a wait-for-page-to-load (if one cannot make any assumptions about the contents)? Note: The obsessive catching-and-ignoring of <code>WebDriverException</code> has proven necessary to ensure that it extracts as many links from the page as possible, whether or not JavaScript inside the page is doing funny stuff with the DOM (e.g. I used to get "stale element" errors in the loop that extracts the HREF attributes). NOTE: There are a lot of variations on this question both on this site and elsewhere, but they've all either got a subtle but critical difference that makes the answers (if any) useless to me, or I've tried the suggestions and they don't work. Please answer exactly the question I have asked.

<ol> <li> The "recommended" (however still ugly) solution could be to use explicit wait: <pre class="prettyprint"><code>from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions old_value = browser.find_element_by_id('thing-on-old-page').text browser.find_element_by_link_text('my link').click() WebDriverWait(browser, 3).until( expected_conditions.text_to_be_present_in_element( (By.ID, 'thing-on-new-page'), 'expected new text' ) ) </code></pre> </li> <li> The naive attempt would be something like this: <pre class="prettyprint"><code>def wait_for(condition_function): start_time = time.time() while time.time() < start_time + 3: if condition_function(): return True else: time.sleep(0.1) raise Exception( 'Timeout waiting for {}'.format(condition_function.__name__) ) def click_through_to_new_page(link_text): browser.find_element_by_link_text('my link').click() def page_has_loaded(): page_state = browser.execute_script( 'return document.readyState;' ) return page_state == 'complete' wait_for(page_has_loaded) </code></pre> </li> <li> Another, better one would be (credits to @ThomasMarks): <pre class="prettyprint"><code>def click_through_to_new_page(link_text): link = browser.find_element_by_link_text('my link') link.click() def link_has_gone_stale(): try: # poll the link with an arbitrary call link.find_elements_by_id('doesnt-matter') return False except StaleElementReferenceException: return True wait_for(link_has_gone_stale) </code></pre> </li> <li> And the final example includes comparing page ids as below (which could be bulletproof): <pre class="prettyprint"><code>class wait_for_page_load(object): def __init__(self, browser): self.browser = browser def __enter__(self): self.old_page = self.browser.find_element_by_tag_name('html') def page_has_loaded(self): new_page = self.browser.find_element_by_tag_name('html') return new_page.id != self.old_page.id def __exit__(self, *_): wait_for(self.page_has_loaded) </code></pre> And now we can do: <pre class="prettyprint"><code>with wait_for_page_load(browser): browser.find_element_by_link_text('my link').click() </code></pre> </li> </ol> Above code samples are from Harry's blog.

Reliably detect page load or time out, Selenium 2

Tags:

python

selenium-webdriver

webdriver

I am writing a generic web-scraper using Selenium 2 (version 2.33 Python bindings, Firefox driver). It is supposed to take an arbitrary URL, load the page, and report all of the outbound links. Because the URL is arbitrary, I cannot make any assumptions whatsoever about the contents of the page, so the usual advice (wait for a specific element to be present) is inapplicable.

I have code which is supposed to poll document.readyState until it reaches "complete" or a 30s timeout has elapsed, and then proceed:

def readystate_complete(d):
    # AFAICT Selenium offers no better way to wait for the document to be loaded,
    # if one is in ignorance of its contents.
    return d.execute_script("return document.readyState") == "complete"

def load_page(driver, url):
    try:
        driver.get(url)
        WebDriverWait(driver, 30).until(readystate_complete)
    except WebDriverException:
        pass

    links = []
    try:
        for elt in driver.find_elements_by_xpath("//a[@href]"):
            try: links.append(elt.get_attribute("href"))
            except WebDriverException: pass
    except WebDriverException: pass
    return links

This sort-of works, but on about one page out of five, the .until call hangs forever. When this happens, usually the browser has not in fact finished loading the page (the "throbber" is still spinning) but tens of minutes can go by and the timeout does not trigger. But sometimes the page does appear to have loaded completely and the script still does not go on.

What gives? How do I make the timeout work reliably? Is there a better way to request a wait-for-page-to-load (if one cannot make any assumptions about the contents)?

Note: The obsessive catching-and-ignoring of WebDriverException has proven necessary to ensure that it extracts as many links from the page as possible, whether or not JavaScript inside the page is doing funny stuff with the DOM (e.g. I used to get "stale element" errors in the loop that extracts the HREF attributes).

NOTE: There are a lot of variations on this question both on this site and elsewhere, but they've all either got a subtle but critical difference that makes the answers (if any) useless to me, or I've tried the suggestions and they don't work. Please answer exactly the question I have asked.

734

asked Sep 10 '13 22:09

zwol

1 Answers

The "recommended" (however still ugly) solution could be to use explicit wait:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions

old_value = browser.find_element_by_id('thing-on-old-page').text
browser.find_element_by_link_text('my link').click()
WebDriverWait(browser, 3).until(
    expected_conditions.text_to_be_present_in_element(
        (By.ID, 'thing-on-new-page'),
        'expected new text'
    )
)

The naive attempt would be something like this:

def wait_for(condition_function):
    start_time = time.time()
    while time.time() < start_time + 3:
        if condition_function():
            return True
        else:
            time.sleep(0.1)
    raise Exception(
        'Timeout waiting for {}'.format(condition_function.__name__)
    )


def click_through_to_new_page(link_text):
    browser.find_element_by_link_text('my link').click()

    def page_has_loaded():
        page_state = browser.execute_script(
            'return document.readyState;'
        ) 
        return page_state == 'complete'

    wait_for(page_has_loaded)

Another, better one would be (credits to @ThomasMarks):

def click_through_to_new_page(link_text):
    link = browser.find_element_by_link_text('my link')
    link.click()

    def link_has_gone_stale():
        try:
            # poll the link with an arbitrary call
            link.find_elements_by_id('doesnt-matter') 
            return False
        except StaleElementReferenceException:
            return True

    wait_for(link_has_gone_stale)

And the final example includes comparing page ids as below (which could be bulletproof):

class wait_for_page_load(object):

    def __init__(self, browser):
        self.browser = browser

    def __enter__(self):
        self.old_page = self.browser.find_element_by_tag_name('html')

    def page_has_loaded(self):
        new_page = self.browser.find_element_by_tag_name('html')
        return new_page.id != self.old_page.id

    def __exit__(self, *_):
        wait_for(self.page_has_loaded)

And now we can do:

with wait_for_page_load(browser):
    browser.find_element_by_link_text('my link').click()

Above code samples are from Harry's blog.

100

answered Oct 13 '22 11:10

kenorb

Related questions
                            
                                Convert EMF/WMF files to PNG/JPG
                            
                                Deep version of sys.getsizeof [duplicate]
                            
                                How could I arrange multiple pyplot figures in a kind of layout?
                            
                                Paramiko / ssh / tail + grep hangs
                            
                                Digitizing an analog signal
                            
                                tracking progress of a celery.group task?
                            
                                Running Blender python script outside of blender
                            
                                Embedded python: multiprocessing not working
                            
                                Fit points to a plane algorithms, how to iterpret results?
                            
                                Tie breaking of round with numpy
                            
                                How to convert a html table into pandas dataframe
                            
                                Python XML parsing from website
                            
                                Is any magic method called on an object in a list during join()?
                            
                                How to create transparent widgets using Tkinter?
                            
                                Passing data from Django view to D3
                            
                                Scala equivalent of Python help()
                            
                                Integrate 2D kernel density estimate
                            
                                How to choose the version of excel which win32com.client has to use in python?
                            
                                Setting up virtual environment in PyCharm
                            
                                Python inner functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With