Is it possible to create a spider which inherits the functionality from two base spiders, namely SitemapSpider and CrawlSpider?
I have been trying to scrape data from various sites and realized that not all sites have listing of every page on the website, thus a need to use CrawlSpider. But CrawlSpider goes through a lot of junk pages and is kind of an overkill.
What I would like to do is something like this:
Start my Spider which is a subclass of SitemapSpider and pass regex matched responses to the parse_products to extract useful information method.
Go to links matching the regex: /reviews/ from the products page,
and sending the data to parse_review function.
Note: "/reviews/" type pages are not listed in sitemap
Extract information from /reviews/ page
CrawlSpider is basically for recursive crawls and scraping
-------ADDITIONAL DETAILS-------
The site in question is www.flipkart.com The site has listings for a lot of products, with each page having its own detail page. Along with the details page, their is a corresponding "review" page for the product. The link to the review page is also available on the product details page.
Note: Review pages are not listed on the sitemap.
class WebCrawler(SitemapSpider, CrawlSpider):
name = "flipkart"
allowed_domains = ['flipkart.com']
sitemap_urls = ['http://www.flipkart.com/robots.txt']
sitemap_rules = [(regex('/(.*?)/p/(.*?)'), 'parse_product')]
start_urls = ['http://www.flipkart.com/']
rules = [Rule(LinkExtractor(allow=['/(.*?)/product-reviews/(.*?)']), 'parse_reviews'),
Rule(LinkExtractor(restrict_xpaths='//div[@class="fk-navigation fk-text-center tmargin10"]'), follow=True)]
def parse_product(self, response):
loader = FlipkartItemLoader(response=response)
loader.add_value('pid', 'value of pid')
loader.add_xpath('name', 'xpath to name')
yield loader.load_item()
def parse_reviews(self, response):
loader = ReviewItemLoader(response=response)
loader.add_value('pid','value of pid')
loader.add_xpath('review_title', 'xpath to review title')
loader.add_xpath('review_text', 'xpath to review text')
yield loader.load_item()
You are on the right track, the only thing left is at the end of your parse_product
function, you have to yield all the urls extracted by the crawler like so
def parse_product(self, response):
loader = FlipkartItemLoader(response=response)
loader.add_value('pid', 'value of pid')
loader.add_xpath('name', 'xpath to name')
yield loader.load_item()
# CrawlSpider defines this method to return all scraped urls.
yield from self.parse(response)
If you don't have the yield from
syntax then just use
for req in self.parse(response):
yield req
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With