This is less of a "how do I use these?" and more of "when/why would I use these?" type question.
EDIT: This question is a near duplicate of this question, which suggests the use a Download Middleware to filter such requests. Updated my question below to reflect that.
In the Scrapy CrawlSpider documentation, rules accept two callables, process_links
and process_request
(documentation quoted below for easier reference).
By default Scrapy is filtering duplicated URLs, but I'm looking to do additional filtering of requests because I get duplicates of pages that have multiple distinct URLs linking to them. Things like,
URL1 = "http://example.com/somePage.php?id=XYZ&otherParam=fluffyKittens"
URL2 = "http://example.com/somePage.php?id=XYZ&otherParam=scruffyPuppies"
However, these URLs will have a similar element in the query string - shown above it is the id
.
I'm thinking it would make sense to use the process_links
callable of my spider to filter out duplicate requests.
Questions:
process_request
would be better suite to this task?process_request
would be more applicable?process_links
or process_request
? If so, can you provide an example of when process_links
or process_request
would be a better solution?Documentation quote:
process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.
process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).
No, process_links
is your better option here are as you are just filtering urls and will save the overhead of having to create the Request
in process_request
just to discard it.
process_request
is useful if you want to massage the Request
a little before you send it off, say if you want to add a meta
argument or perhaps add or remove headers.
you don't need any middleware in your case because the functionality you need is built directly into the Rule
. If process_links
were not built into the rules, then you would need to create your own middleware.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With