Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using python with selenium to scrape dynamic web pages

On the site, there are a couple of links at the top labeled 1, 2, 3, and next. If a link labeled by a number is pressed, it dynamically loads in some data into a content div. If next is pressed, it goes to a page with labels 4, 5, 6, next and the data for page 4 is shown.

I want to scrape the data from the content div for all links pressed (I don't know how many there are, it just shows 3 at a time and next)

Please give an example of how to do it. For instance, consider the site www.cnet.com.

Please guide me to download the series of pages using selenium and parse them to handle with beautiful soup on my own.

like image 652
Koushik Avatar asked Dec 21 '22 04:12

Koushik


1 Answers

General layout (not tested):

#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium

url = "http://example.com"

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
    n = 1
    while n < 10:
        browser.get(url) # load page
        link = browser.find_element_by_link_text(str(n))
        while link:
           browser.get(link.get_attribute("href")) # get individual 1,2,3,4 pages
           #### save(browser.page_source)
           browser.back() # return to page that has 1,2,3,next -like links
           n += 1
           link = browser.find_element_by_link_text(str(n))

        link = browser.find_element_by_link_text("next")
        if not link: break
        url = link.get_attribute("href")
like image 92
jfs Avatar answered Jan 06 '23 10:01

jfs