Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping with Scrapy and Selenium

I have a scrapy spider which crawls a site that reloads content via javascript on the page. In order to move to the next page to scrape, I have been using Selenium to click on the month link at the top of the site.

The problem is that, even though my code moves through each link as expected, the spider just scrapes the first month (Sept) data for the number of months and returns this duplicate data.

How can I get around this?

from selenium import webdriver

class GigsInScotlandMain(InitSpider):
        name = 'gigsinscotlandmain'
        allowed_domains = ["gigsinscotland.com"]
        start_urls = ["http://www.gigsinscotland.com"]


    def __init__(self):
        InitSpider.__init__(self)
        self.br = webdriver.Firefox()

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        self.br.get(response.url)
        time.sleep(2.5)
        # Get the string for each month on the page.
        months = hxs.select("//ul[@id='gigsMonths']/li/a/text()").extract()

        for month in months:
            link = self.br.find_element_by_link_text(month)
            link.click()
            time.sleep(5)

            # Get all the divs containing info to be scraped.
            listitems = hxs.select("//div[@class='listItem']")
            for listitem in listitems:
                item = GigsInScotlandMainItem()
                item['artist'] = listitem.select("div[contains(@class, 'artistBlock')]/div[@class='artistdiv']/span[@class='artistname']/a/text()").extract()
                #
                # Get other data ...
                #
                yield item
like image 503
puffin Avatar asked Sep 16 '13 19:09

puffin


People also ask

Can you use Selenium and Scrapy together?

Combining Selenium with Scrapy is a simpler process. All that needs to be done is let Selenium render the webpage and once it is done, pass the webpage's source to create a Scrapy Selector object. And from here on, Scrapy can crawl the page with ease and effectively extract a large amount of data.

Which is better Selenium or Scrapy?

Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.

Is Scrapy good for web scraping?

Scrapy is a full-stack python framework for web scraping. It is a tool for large-scale web scraping. It has a built-in mechanism called selectors for extracting data from the web. It is an open-source and free-to-use framework written in python.

Which is better Scrapy or Beautifulsoup or Selenium?

If you are a beginner and if you want to learn things quickly and want to perform web scraping operations then Beautiful Soup is the best choice. Selenium: When you are dealing with Core Javascript featured website then Selenium would be the best choice. but the Data size should be limited.


1 Answers

The problem is that you are reusing HtmlXPathSelector that was defined for the initial response. Redefine it from selenium browser source_code:

...
for month in months:
    link = self.br.find_element_by_link_text(month)
    link.click()
    time.sleep(5)

    hxs = HtmlXPathSelector(self.br.page_source)

    # Get all the divs containing info to be scraped.
    listitems = hxs.select("//div[@class='listItem']")
...
like image 145
alecxe Avatar answered Sep 25 '22 14:09

alecxe