I have this rule for scrapy CrawlSpider
rules = [
Rule(LinkExtractor(
allow= '/topic/\d+/organize$',
restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
),
process_request='request_tagPage', callback = "parse_tagPage", follow = True)
]
request_tagePage()
refers to a function to add cookie into requests and parse_tagPage()
refers to a function to parse target pages. According to documentation, CrawlSpider should use request_tagPage
to make requests and once responses are returned, it calls parse_tagPage()
to parse it. However, I realized that when request_tagPage()
is used, spider doesn't call the parse_tagPage()
at all. So in the actual code, I manually add parse_tagPage()
callback function as a callback in request_tagPage
, like this:
def request_tagPage(self, request):
return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
headers = self.headers,\
callback=self.parse_tagPage) # manually add a callback function.
It worked but now the spider doesn't use rules to expand its crawling. It closes after crawl the links from start_urls
. However, before I manually set the parse_tagPage()
as callback into request_tagPage()
, the rules works. So I am thinking this maybe a bug? Is a way to enable request_tagPage()
, which I need to attach cookie in the request, parse_tagPage()
, which used to parse a page and rules
, which directs spider to crawl?
Requests generated by CrawlSpider
rules use internal callbacks and use meta
to do their "magic".
I suggest that you don't recreate Requests from scratch in your rules' process_request
hooks (or you'll probably end-up reimplementing what CrawlSpider
does for you already).
Instead, if you just want to add cookies and special headers, you can use .replace()
method on the request that is passed to request_tagPage
, so that the "magic" of CrawlSpider
is preserved.
Something like this should be enough:
def request_tagPage(self, request):
tagged = request.replace(headers=self.headers)
tagged.meta.update(cookiejar=1)
return tagged
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With