I am creating a CrawlSpider to scrape a product website. From page 1, I extract category urls of the form www.domain.com/color (simplified). On the category page, I follow the first link to a product detail page, parse the product detail page and crawl to the next one via a Next link. Each color category therefore has a unique crawl path.
The difficulty is that the color variable is not on the product detail page. I can extract it from the category page by parsing the link as follows:
def parse_item(self, response):
l = XPathItemLoader(item=Greenhouse(), response=response)
l.default_output_processor = Join()
l.add_value('color', response.url.split("/")[-1])
return l.load_item()
However, I want to add this color value to the items parsed from the product detail page for the products crawled starting from a particular color category page. The product urls are crawled by following Next links, so the referring category page is lost after the first link. There is something in Scrapy docs about request.meta which can pass data between parsers, but I'm not sure this applies here. Any help would be appreciated.
My rules are:
Rule(SgmlLinkExtractor(restrict_xpaths=('//table[@id="ctl18_ctlFacetList_dlFacetList"]/tr[2]/td',)),),
Rule(SgmlLinkExtractor(restrict_xpaths=('//table[@id="ctl18_dlProductList"]/tr[1]/td[@class="ProductListItem"][1]',)),callback='parse_item', follow=True,),
Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="ctl18_ctl00_lbNext"]',)),callback='parse_item', follow=True, ),
You can use the process_request
argument of your rules:
class MySPider(CrawlSpider):
...
rules = [...
Rule(SgmlLinkExtractor(), process_request='add_color'),
]
def add_color(self, request):
meta = dict(color=request.url.split("/")[-1])
return request.replace(meta=meta)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With