Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Example code for Scrapy process_links and process_request

Tags:

python

scrapy

I am new to Scrapy and I was hoping if anyone can give me good example codes of when process_links and process_request are most useful. I see that process_links is used to filter URL's but I don't know how to code it.

Thank you.

like image 670
Arrow Avatar asked Jul 15 '16 15:07

Arrow


1 Answers

You mean scrapy.spiders.Rule that is most commonly used in scrapy.CrawlSpider

They do pretty much what the names say or in other words that act as sort of middleware between the time the link is extracted and processed/downloaded.

process_links sits between when link is extracted and turned into request . There are pretty cool use cases for this, just to name a few common ones:

  1. Filter out some links you don't like.
  2. Do redirection manually to avoid bad requests.

example:

def process_links(self, link):
    for link in links:
        #1
        if 'foo' in link.text:
            continue  # skip all links that have "foo" in their text
        yield link 
        #2
        link.url = link.url + '/'  # fix url to avoid unnecessary redirection
        yield link

process_requests sits between request that was just made and before it is being downloaded. It shares some use cases with the process_links but can actually do some other cool stuff like:

  1. Modify headers(e.g. cookies).
  2. Change details like callback, depending on some keywords in the url.

example:

def process_req(self, req):
    # 1
    req = req.replace(headers={'Cookie':'foobar'})
    return req
    # 2
    if 'foo' in req.url:
        return req.replace(callback=self.parse_foo)
    elif 'bar' in req.url:
        return req.replace(callback=self.parse_bar)
    return req

You probably not gonna use them often but these two can be really convenient and easy shortcuts on some occasions.

like image 185
Granitosaurus Avatar answered Sep 21 '22 11:09

Granitosaurus