I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:
I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?
My spider is pretty standard, like the following:
class ProductSpider(CrawlSpider): name = "product_spider" allowed_domains = ['example.com'] start_urls = ['http://example.com/shanghai'] rules = [ Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'), ] def parse_product(self, response): self.log("parsing product %s" %response.url, level=INFO) hxs = HtmlXPathSelector(response) # actual data follows
Any idea is appreciated. Thank you!
We use parse method and call this function, this function is used to extracts data from the sites, however, to scrape the sites it is necessary to understand the command response selector CSS and XPath. Request: It is a request which realizes a call for objects or data. Response: It obtains an answer to the Request.
Combining Selenium with Scrapy is a simpler process. All that needs to be done is let Selenium render the webpage and once it is done, pass the webpage's source to create a Scrapy Selector object. And from here on, Scrapy can crawl the page with ease and effectively extract a large amount of data.
Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.
It really depends on how do you need to scrape the site and how and what data do you want to get.
Here's an example how you can follow pagination on ebay using Scrapy
+Selenium
:
import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = "product_spider" allowed_domains = ['ebay.com'] start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40'] def __init__(self): self.driver = webdriver.Firefox() def parse(self, response): self.driver.get(response.url) while True: next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a') try: next.click() # get the data and write it to scrapy items except: break self.driver.close()
Here are some examples of "selenium spiders":
There is also an alternative to having to use Selenium
with Scrapy
. In some cases, using ScrapyJS
middleware is enough to handle the dynamic parts of a page. Sample real-world usage:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With