I have a scrapy spider which crawls a site that reloads content via javascript on the page. In order to move to the next page to scrape, I have been using Selenium to click on the month link at the top of the site.
The problem is that, even though my code moves through each link as expected, the spider just scrapes the first month (Sept) data for the number of months and returns this duplicate data.
How can I get around this?
from selenium import webdriver
class GigsInScotlandMain(InitSpider):
name = 'gigsinscotlandmain'
allowed_domains = ["gigsinscotland.com"]
start_urls = ["http://www.gigsinscotland.com"]
def __init__(self):
InitSpider.__init__(self)
self.br = webdriver.Firefox()
def parse(self, response):
hxs = HtmlXPathSelector(response)
self.br.get(response.url)
time.sleep(2.5)
# Get the string for each month on the page.
months = hxs.select("//ul[@id='gigsMonths']/li/a/text()").extract()
for month in months:
link = self.br.find_element_by_link_text(month)
link.click()
time.sleep(5)
# Get all the divs containing info to be scraped.
listitems = hxs.select("//div[@class='listItem']")
for listitem in listitems:
item = GigsInScotlandMainItem()
item['artist'] = listitem.select("div[contains(@class, 'artistBlock')]/div[@class='artistdiv']/span[@class='artistname']/a/text()").extract()
#
# Get other data ...
#
yield item
Combining Selenium with Scrapy is a simpler process. All that needs to be done is let Selenium render the webpage and once it is done, pass the webpage's source to create a Scrapy Selector object. And from here on, Scrapy can crawl the page with ease and effectively extract a large amount of data.
Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.
Scrapy is a full-stack python framework for web scraping. It is a tool for large-scale web scraping. It has a built-in mechanism called selectors for extracting data from the web. It is an open-source and free-to-use framework written in python.
If you are a beginner and if you want to learn things quickly and want to perform web scraping operations then Beautiful Soup is the best choice. Selenium: When you are dealing with Core Javascript featured website then Selenium would be the best choice. but the Data size should be limited.
The problem is that you are reusing HtmlXPathSelector
that was defined for the initial response. Redefine it from selenium browser source_code
:
...
for month in months:
link = self.br.find_element_by_link_text(month)
link.click()
time.sleep(5)
hxs = HtmlXPathSelector(self.br.page_source)
# Get all the divs containing info to be scraped.
listitems = hxs.select("//div[@class='listItem']")
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With