On the site, there are a couple of links at the top labeled 1, 2, 3, and next. If a link labeled by a number is pressed, it dynamically loads in some data into a content div
. If next is pressed, it goes to a page with labels 4, 5, 6, next and the data for page 4 is shown.
I want to scrape the data from the content div
for all links pressed (I don't know how many there are, it just shows 3 at a time and next)
Please give an example of how to do it. For instance, consider the site www.cnet.com.
Please guide me to download the series of pages using selenium and parse them to handle with beautiful soup on my own.
General layout (not tested):
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
url = "http://example.com"
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
n = 1
while n < 10:
browser.get(url) # load page
link = browser.find_element_by_link_text(str(n))
while link:
browser.get(link.get_attribute("href")) # get individual 1,2,3,4 pages
#### save(browser.page_source)
browser.back() # return to page that has 1,2,3,next -like links
n += 1
link = browser.find_element_by_link_text(str(n))
link = browser.find_element_by_link_text("next")
if not link: break
url = link.get_attribute("href")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With