Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass a url value to all subsequent items in the Scrapy crawl?

I am creating a CrawlSpider to scrape a product website. From page 1, I extract category urls of the form www.domain.com/color (simplified). On the category page, I follow the first link to a product detail page, parse the product detail page and crawl to the next one via a Next link. Each color category therefore has a unique crawl path.

The difficulty is that the color variable is not on the product detail page. I can extract it from the category page by parsing the link as follows:

def parse_item(self, response):
        l = XPathItemLoader(item=Greenhouse(), response=response)
        l.default_output_processor = Join()
        l.add_value('color', response.url.split("/")[-1])
        return l.load_item()

However, I want to add this color value to the items parsed from the product detail page for the products crawled starting from a particular color category page. The product urls are crawled by following Next links, so the referring category page is lost after the first link. There is something in Scrapy docs about request.meta which can pass data between parsers, but I'm not sure this applies here. Any help would be appreciated.

My rules are:

Rule(SgmlLinkExtractor(restrict_xpaths=('//table[@id="ctl18_ctlFacetList_dlFacetList"]/tr[2]/td',)),),
Rule(SgmlLinkExtractor(restrict_xpaths=('//table[@id="ctl18_dlProductList"]/tr[1]/td[@class="ProductListItem"][1]',)),callback='parse_item', follow=True,),
Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="ctl18_ctl00_lbNext"]',)),callback='parse_item', follow=True, ),
like image 931
Dan Walker Avatar asked Nov 12 '22 07:11

Dan Walker


1 Answers

You can use the process_request argument of your rules:

class MySPider(CrawlSpider):
    ...
    rules = [...
        Rule(SgmlLinkExtractor(), process_request='add_color'),
    ]

    def add_color(self, request):
        meta = dict(color=request.url.split("/")[-1])
        return request.replace(meta=meta)
like image 66
Steven Almeroth Avatar answered Nov 14 '22 23:11

Steven Almeroth