Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy rules not working when process_request and callback parameter are set

I have this rule for scrapy CrawlSpider

rules = [
        Rule(LinkExtractor(
                    allow= '/topic/\d+/organize$', 
                    restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
                    ),
           process_request='request_tagPage', callback = "parse_tagPage", follow = True)
    ]

request_tagePage() refers to a function to add cookie into requests and parse_tagPage() refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage to make requests and once responses are returned, it calls parse_tagPage() to parse it. However, I realized that when request_tagPage() is used, spider doesn't call the parse_tagPage() at all. So in the actual code, I manually add parse_tagPage() callback function as a callback in request_tagPage, like this:

def request_tagPage(self, request):
    return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
            headers = self.headers,\
            callback=self.parse_tagPage) # manually add a callback function.

It worked but now the spider doesn't use rules to expand its crawling. It closes after crawl the links from start_urls. However, before I manually set the parse_tagPage() as callback into request_tagPage(), the rules works. So I am thinking this maybe a bug? Is a way to enable request_tagPage(), which I need to attach cookie in the request, parse_tagPage() , which used to parse a page and rules, which directs spider to crawl?

like image 908
Skywalker326 Avatar asked Mar 12 '23 18:03

Skywalker326


1 Answers

Requests generated by CrawlSpider rules use internal callbacks and use meta to do their "magic".

I suggest that you don't recreate Requests from scratch in your rules' process_request hooks (or you'll probably end-up reimplementing what CrawlSpider does for you already).

Instead, if you just want to add cookies and special headers, you can use .replace() method on the request that is passed to request_tagPage, so that the "magic" of CrawlSpider is preserved.

Something like this should be enough:

def request_tagPage(self, request):
    tagged = request.replace(headers=self.headers)
    tagged.meta.update(cookiejar=1)
    return tagged
like image 142
paul trmbrth Avatar answered Mar 19 '23 15:03

paul trmbrth