Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scroll down to bottom of infinite page with PhantomJS in Python

Tags:

I have succeeded in getting Python with Selenium and PhantomJS to reload a dynamically loading infinite scrolling page, like in the example below. But how could this be modified so that instead of setting a number of reloads manually, the program stopped when reaching rock bottom?

reloads = 100000 #set the number of times to reload pause = 0 #initial time interval between reloads driver = webdriver.PhantomJS()  # Load Twitter page and click to view all results driver.get(url) driver.find_element_by_link_text("All").click()  # Keep reloading and pausing to reach the bottom for _ in range(reloads):     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")     time.sleep(pause)  text_file.write(driver.page_source.encode("utf-8")) text_file.close() 
like image 949
DIGSUM Avatar asked Mar 08 '15 15:03

DIGSUM


People also ask

How do you get to the bottom of an infinite scroll?

also Hot key Ctrl +End will take you directly to the bottom of the page.

How do you scrape data from Infinite scrolling pages in Python?

First, we visit Scraping Infinite Scrolling Pages Exercise, then open web dev tools of our browser to help us inspect the web traffic of the website. If you are new to web dev tools , just Right-click on any page element and select Inspect Element. . As you can see, a panel shows up for you to inspect the web page.

How do you scroll to the bottom of a page in selenium Python?

Do you want the Web Browser to scroll to the end of the page while using Python Selenium? You can do that with code, the trick is to inject Javascript code to be webpage. After you load a webpage, scroll down the page by injecting javascript. You can scroll down a specific amount or all the way to the bottom.

How do you scroll to the bottom of the page using selenium?

Selenium runs the commands in Javascript with the execute_script() method. For scrolling down to the bottom of the page, we have to pass (0, document. body. scrollHeight) as parameters to the method scrollBy().


1 Answers

You can check whether the scroll did anything in every step.

lastHeight = driver.execute_script("return document.body.scrollHeight") while True:     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")     time.sleep(pause)     newHeight = driver.execute_script("return document.body.scrollHeight")     if newHeight == lastHeight:         break     lastHeight = newHeight 

This uses a static wait amount which is bad because you don't want to wait unnecessary when it finishes faster and you don't want that the script exits prematurely when the dynamic load is too slow for some reason.

Since a page usually loads some more elements into a list, you can check the length of the list before the load and wait until the next element is loaded.

For twitter this could look like this:

while True:     elemsCount = browser.execute_script("return document.querySelectorAll('.stream-items > li.stream-item').length")      browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")      try:         WebDriverWait(browser, 20).until(             lambda x: x.find_element_by_xpath(                 "//*[contains(@class,'stream-items')]/li[contains(@class,'stream-item')]["+str(elemsCount+1)+"]"))     except:         break 

I used an XPath expression, because PhantomJS 1.x has a bug sometimes when using :nth-child() CSS selectors.

Full version for reference.

like image 69
Artjom B. Avatar answered Oct 23 '22 06:10

Artjom B.