selenium with scrapy for dynamic page

Tags:

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

starts with a product_list page with 10 products
a click on "next" button loads the next 10 products (url doesn't change between the two pages)
i use LinkExtractor to follow each product link into the product page, and get all the information I need

I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

My spider is pretty standard, like the following:

class ProductSpider(CrawlSpider):     name = "product_spider"     allowed_domains = ['example.com']     start_urls = ['http://example.com/shanghai']     rules = [         Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),         ]      def parse_product(self, response):         self.log("parsing product %s" %response.url, level=INFO)         hxs = HtmlXPathSelector(response)         # actual data follows

Any idea is appreciated. Thank you!

727

asked Jul 31 '13 16:07

Z. Lin

1 Answers

It really depends on how do you need to scrape the site and how and what data do you want to get.

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy from selenium import webdriver  class ProductSpider(scrapy.Spider):     name = "product_spider"     allowed_domains = ['ebay.com']     start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']      def __init__(self):         self.driver = webdriver.Firefox()      def parse(self, response):         self.driver.get(response.url)          while True:             next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')              try:                 next.click()                  # get the data and write it to scrapy items             except:                 break          self.driver.close()

Here are some examples of "selenium spiders":

Executing Javascript Submit form functions using scrapy in python
https://gist.github.com/cheekybastard/4944914
https://gist.github.com/irfani/1045108
http://snipplr.com/view/66998/

There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

Scraping dynamic content using python-Scrapy

answered Oct 05 '22 20:10

alecxe

Related questions
                            
                                Why is max slower than sort?
                            
                                Round float to x decimals?
                            
                                How to run Ansible without specifying the inventory but the host directly?
                            
                                How to use Stanford Parser in NLTK using Python
                            
                                how is axis indexed in numpy's array?
                            
                                Insert an item into sorted list in Python
                            
                                ImportError: No module named BeautifulSoup
                            
                                Python: give start and end of week data from a given date
                            
                                How to make python Requests work via socks proxy
                            
                                Interactively validating Entry widget content in tkinter
                            
                                Convert True/False value read from file to boolean
                            
                                Interweaving two numpy arrays
                            
                                Access Lovoo API using Python
                            
                                Is it possible to copy a cell from one jupyter notebook to another?
                            
                                How to parse multiple nested sub-commands using python argparse?
                            
                                What does it mean for an object to be picklable (or pickle-able)?
                            
                                What are the differences between ipython and bpython?
                            
                                Automatically run %matplotlib inline in IPython Notebook
                            
                                Import python package from local directory into interpreter
                            
                                ValueError: could not convert string to float: id

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

selenium with scrapy for dynamic page

Tags:

python

selenium

selenium-webdriver

web-scraping

scrapy

Z. Lin

People also ask

1 Answers

alecxe

Recent Activity

Donate For Us