Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: CrawlSpider Rules process_links vs process_request vs download middleware [duplicate]

This is less of a "how do I use these?" and more of "when/why would I use these?" type question.

EDIT: This question is a near duplicate of this question, which suggests the use a Download Middleware to filter such requests. Updated my question below to reflect that.

In the Scrapy CrawlSpider documentation, rules accept two callables, process_links and process_request (documentation quoted below for easier reference).

By default Scrapy is filtering duplicated URLs, but I'm looking to do additional filtering of requests because I get duplicates of pages that have multiple distinct URLs linking to them. Things like,

URL1 = "http://example.com/somePage.php?id=XYZ&otherParam=fluffyKittens"
URL2 = "http://example.com/somePage.php?id=XYZ&otherParam=scruffyPuppies"

However, these URLs will have a similar element in the query string - shown above it is the id.

I'm thinking it would make sense to use the process_links callable of my spider to filter out duplicate requests.

Questions:

  1. Is there some reason why process_request would be better suite to this task?
  2. If not, can you provide an example of when process_request would be more applicable?
  3. Is a download middleware more appropriate than either process_links or process_request? If so, can you provide an example of when process_links or process_request would be a better solution?

Documentation quote:

process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.

process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).

like image 449
CatShoes Avatar asked Apr 16 '13 14:04

CatShoes


1 Answers

  1. No, process_links is your better option here are as you are just filtering urls and will save the overhead of having to create the Request in process_request just to discard it.

  2. process_request is useful if you want to massage the Request a little before you send it off, say if you want to add a meta argument or perhaps add or remove headers.

  3. you don't need any middleware in your case because the functionality you need is built directly into the Rule. If process_links were not built into the rules, then you would need to create your own middleware.

like image 67
Steven Almeroth Avatar answered Nov 14 '22 22:11

Steven Almeroth