I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items). These URLs are fed to the spider via my crawler's start_requests()
method.
Some URLs can be classified without downloading them, so I would like to yield
directly an Item
for them in start_requests()
, which is forbidden by scrapy. How can I circumvent this?
I have thought about catching these requests in a custom middleware that would turn them into spurious Response
objects, that I could then convert into Item
objects in the request callback, but any cleaner solution would be welcome.
You could use Downloader Middleware to do this job.
In start_requests()
, you should always make a request, for example:
def start_requests(self):
for url in all_urls:
yield scrapy.Request(url)
However, you should write a downloader middleware:
class DirectReturn:
def process_request(self, request, spider):
image_url = request.url
if url in direct_return_url_set:
resp = Response(image_url, request=request)
request.meta['direct_return_url': True]
return resp
else:
return request
Then, in your parse
method, just check if key direct_return_url
in response.meta
. if yes, just generate an item and put response.url to it and then yield this item.
I think using a spider middleware and overwriting the start_requests() would be a good start.
In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With