How to pass a url value to all subsequent items in the Scrapy crawl?

Question

I am creating a CrawlSpider to scrape a product website. From page 1, I extract category urls of the form www.domain.com/color (simplified). On the category page, I follow the first link to a product detail page, parse the product detail page and crawl to the next one via a Next link. Each color category therefore has a unique crawl path.

The difficulty is that the color variable is not on the product detail page. I can extract it from the category page by parsing the link as follows:

def parse_item(self, response):
        l = XPathItemLoader(item=Greenhouse(), response=response)
        l.default_output_processor = Join()
        l.add_value('color', response.url.split("/")[-1])
        return l.load_item()

However, I want to add this color value to the items parsed from the product detail page for the products crawled starting from a particular color category page. The product urls are crawled by following Next links, so the referring category page is lost after the first link. There is something in Scrapy docs about request.meta which can pass data between parsers, but I'm not sure this applies here. Any help would be appreciated.

My rules are:

Rule(SgmlLinkExtractor(restrict_xpaths=('//table[@id="ctl18_ctlFacetList_dlFacetList"]/tr[2]/td',)),),
Rule(SgmlLinkExtractor(restrict_xpaths=('//table[@id="ctl18_dlProductList"]/tr[1]/td[@class="ProductListItem"][1]',)),callback='parse_item', follow=True,),
Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="ctl18_ctl00_lbNext"]',)),callback='parse_item', follow=True, ),

Steven Almeroth · Accepted Answer

You can use the process_request argument of your rules:

class MySPider(CrawlSpider):
    ...
    rules = [...
        Rule(SgmlLinkExtractor(), process_request='add_color'),
    ]

    def add_color(self, request):
        meta = dict(color=request.url.split("/")[-1])
        return request.replace(meta=meta)

How to pass a url value to all subsequent items in the Scrapy crawl?

Tags:

python

web-scraping

scrapy

Dan Walker

1 Answers

Steven Almeroth

Recent Activity

Donate For Us

How to pass a url value to all subsequent items in the Scrapy crawl?

Tags:

python

web-scraping

scrapy

Dan Walker

1 Answers

Steven Almeroth

Related questions

Recent Activity

Donate For Us