Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

selenium with scrapy for dynamic page

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

  • starts with a product_list page with 10 products
  • a click on "next" button loads the next 10 products (url doesn't change between the two pages)
  • i use LinkExtractor to follow each product link into the product page, and get all the information I need

I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

My spider is pretty standard, like the following:

class ProductSpider(CrawlSpider):     name = "product_spider"     allowed_domains = ['example.com']     start_urls = ['http://example.com/shanghai']     rules = [         Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),         ]      def parse_product(self, response):         self.log("parsing product %s" %response.url, level=INFO)         hxs = HtmlXPathSelector(response)         # actual data follows 

Any idea is appreciated. Thank you!

like image 727
Z. Lin Avatar asked Jul 31 '13 16:07

Z. Lin


People also ask

How do you scrape a dynamic website from Scrapy?

We use parse method and call this function, this function is used to extracts data from the sites, however, to scrape the sites it is necessary to understand the command response selector CSS and XPath. Request: It is a request which realizes a call for objects or data. Response: It obtains an answer to the Request.

Can you use Scrapy and Selenium together?

Combining Selenium with Scrapy is a simpler process. All that needs to be done is let Selenium render the webpage and once it is done, pass the webpage's source to create a Scrapy Selector object. And from here on, Scrapy can crawl the page with ease and effectively extract a large amount of data.

Is Scrapy better than Selenium?

Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.


1 Answers

It really depends on how do you need to scrape the site and how and what data do you want to get.

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy from selenium import webdriver  class ProductSpider(scrapy.Spider):     name = "product_spider"     allowed_domains = ['ebay.com']     start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']      def __init__(self):         self.driver = webdriver.Firefox()      def parse(self, response):         self.driver.get(response.url)          while True:             next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')              try:                 next.click()                  # get the data and write it to scrapy items             except:                 break          self.driver.close() 

Here are some examples of "selenium spiders":

  • Executing Javascript Submit form functions using scrapy in python
  • https://gist.github.com/cheekybastard/4944914
  • https://gist.github.com/irfani/1045108
  • http://snipplr.com/view/66998/

There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

  • Scraping dynamic content using python-Scrapy
like image 67
alecxe Avatar answered Oct 05 '22 20:10

alecxe