I am new to Scrapy and I was hoping if anyone can give me good example codes of when process_links and process_request are most useful. I see that process_links is used to filter URL's but I don't know how to code it.
Thank you.
You mean scrapy.spiders.Rule
that is most commonly used in scrapy.CrawlSpider
They do pretty much what the names say or in other words that act as sort of middleware between the time the link is extracted and processed/downloaded.
process_links
sits between when link is extracted and turned into request . There are pretty cool use cases for this, just to name a few common ones:
example:
def process_links(self, link):
for link in links:
#1
if 'foo' in link.text:
continue # skip all links that have "foo" in their text
yield link
#2
link.url = link.url + '/' # fix url to avoid unnecessary redirection
yield link
process_requests
sits between request that was just made and before it is being downloaded. It shares some use cases with the process_links
but can actually do some other cool stuff like:
example:
def process_req(self, req):
# 1
req = req.replace(headers={'Cookie':'foobar'})
return req
# 2
if 'foo' in req.url:
return req.replace(callback=self.parse_foo)
elif 'bar' in req.url:
return req.replace(callback=self.parse_bar)
return req
You probably not gonna use them often but these two can be really convenient and easy shortcuts on some occasions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With