Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splinter saves bodiless html

I am using the splinter 0.7.3 module in python 2.7.2 on a Linux platform to scrape a directory listing on a website using the default Firefox browser.

This is the snippet of code that iterates through the paginated web listing by clicking the 'Next' link in the html.

    links = True
    i = 0
    while links:
        with open('html/register_%03d.html' % i, 'w') as f:
            f.write(browser.html.encode('utf-8'))
        links = browser.find_link_by_text('Next')
        print 'links:', links
        if links:
            links[0].click()
        i += 1

I know that the links are working, as I am seeing output that looks like this:

links: [<splinter.driver.webdriver.WebDriverElement object at 0x2e6da10>, <splinter.driver.webdriver.WebDriverElement object at 0x2e6d710>]
links: [<splinter.driver.webdriver.WebDriverElement object at 0x2e6d5d0>, <splinter.driver.webdriver.WebDriverElement object at 0x2e6d950>]
links: [<splinter.driver.webdriver.WebDriverElement object at 0x2e6d710>, <splinter.driver.webdriver.WebDriverElement object at 0x2e6dcd0>]
links: []

When the html is saved at each page using f.write(browser.html.encode('utf-8')) it works fine for the first page. On subsequent pages, although I can see the pages rendered in Firefox, either the html/regiser_...html file is empty or the body tag is missing like this:

<!DOCTYPE html>
<!--[if lt IE 7]>      <html prefix="og: http://ogp.me/ns#" class="no-js lt-ie9 lt-ie8 lt-ie7"  lang="en-gb"> <![endif]-->
<!--[if IE 7]>         <html prefix="og: http://ogp.me/ns#" class="no-js lt-ie9 lt-ie8"  lang="en-gb"> <![endif]-->
<!--[if IE 8]>         <html prefix="og: http://ogp.me/ns#" class="no-js lt-ie9"  lang="en-gb"> <![endif]-->
<!--[if gt IE 8]><!-->
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb" class="no-js" prefix="og: http://ogp.me/ns#"><!--<![endif]--><head>
        <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" />    
    ...
  </style>
  <script src="/media/com_magebridge/js/frototype.min.js" type="text/javascript"></script></head></html>

Is this a known feature of saving html from splinter? Is there a better way to do it?

like image 586
ChrisGuest Avatar asked Sep 17 '15 21:09

ChrisGuest


1 Answers

It really looks like a timing issue - you are getting the page source when the page is not fully loaded. There are several ways to tackle the issue:

  • wait for the body to be present:

    browser.is_element_present_by_tag("body", wait_time=5)
    
  • increase the page load timeout - put this right after you initialize the browser object:

    browser.driver.set_page_load_timeout(10)  # 10 seconds
    
like image 136
alecxe Avatar answered Nov 04 '22 01:11

alecxe