crawl site that has infinite scrolling using python

Tags:

I have been doing research and so far I found out the python package that I will plan on using its scrapy, now I am trying to find out what is a good way to build a scraper using scrapy to crawl site with infinite scrolling. After digging around I found out that there is a package call selenium and it has python module. I have a feeling someone has already done that using Scrapy and Selenium to scrape site with infinite scrolling. It would be great if someone can point towards to an example.

462

asked Mar 28 '14 00:03

add-semi-colons

4 Answers

You can use selenium to scrap the infinite scrolling website like twitter or facebook.

Step 1 : Install Selenium using pip

pip install selenium

Step 2 : use the code below to automate infinite scroll and extract the source code

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(30)
        self.base_url = "https://twitter.com"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "/search?q=stackoverflow&src=typd")
        driver.find_element_by_link_text("All").click()
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":
    unittest.main()

The for loop allows you to parse through the infinite scrolls and post which you can extract the loaded data.

Step 3 : Print the data if required.

191

answered Oct 22 '22 11:10

Pawan Kumar

This is short & simple code which is working for me:

SCROLL_PAUSE_TIME = 20

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

posts = driver.find_elements_by_class_name("post-text")

for block in posts:
    print(block.text)

answered Oct 22 '22 10:10

Sijin John

from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://www.something.com")
lastElement = driver.find_elements_by_id("someId")[-1]
lastElement.send_keys(Keys.NULL)

This will open a page, find the bottom-most element with the given id and the scroll that element into view. You'll have to keep querying the driver to get the last element as the page loads more, and I've found this to be pretty slow as pages get large. The time is dominated by the call to driver.find_element_* because I don't know of a way to explicitly query the last element in the page.

Through experimentation you might find there is an upper limit to the amount of elements the page loads dynamically, and it would be best if you wrote something that loaded that number and only then made a call to driver.find_element_*.

answered Oct 22 '22 11:10

maxywb

For infinite scrolling data are requested to Ajax calls. Open web browser --> network_tab --> clear previous requests history by clicking icon like stop--> scroll the webpage--> now you can find the new request for scroll event--> open the request header --> you can find the URL of request ---> copy and paste URL in an seperare tab--> you can find the result of Ajax call --> just form the requested URL to get the data page until end of the page

answered Oct 22 '22 12:10

Sanjeev Ravi

Related questions
                            
                                How to unit test with a mocked file object in Python?
                            
                                Pandas Dataframe object types fillna exception over different datatypes
                            
                                Python os.path.relpath behavior
                            
                                Kivy how to rotate a picture
                            
                                How to convert a stat output to a unix permissions string
                            
                                Adjusting gridlines on a 3D Matplotlib figure
                            
                                Is there a python equivalent of R's qchisq function?
                            
                                remove x ticks but keep grid lines
                            
                                Getting name of subclass from superclass?
                            
                                Cutting of unused frequencies in specgram matplotlib
                            
                                Parameter validation, Best practices in Python
                            
                                parsing json fields in python
                            
                                Changing x-axis tick labels when working with subplots
                            
                                NameError: global name 'myExample2' is not defined # modules
                            
                                mpi4py Send/Recv with tag
                            
                                Is sqlite3 fetchall necessary?
                            
                                How to get the output from .jar execution in python codes?
                            
                                How to rotate a simple matplotlib Axes
                            
                                matplotlib: render into buffer / access pixel data
                            
                                python np.round() with decimal option larger than 2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

crawl site that has infinite scrolling using python

Tags:

python

selenium

scrapy

web-crawler