I am using scrapy to download some articles from a website as well as images within the articles.
Some info about the scenario:
download_delay
to avoid <403> errorscrapy.contrib.pipeline.images.ImagesPipeline
download_delay
settingHow can I speed up the image download while I have to limit the download speed from main domain?
There is no public interface for it (so my answer can become invalid in future Scrapy versions), but you can check the implementation of the built-in AutoThrottle extension.
It is a bit complicated, but in Scrapy 1.0 the idea is the following: there is Downloader which handles all downloads. To decide how many requests to send in parallel and which delays to use Downloader uses "slots". By changing slot attributes (delay
, concurrency
) you can change Downloader behaviour. By default, there is a slot per domain (or per IP address if CONCURRENT_REQUESTS_PER_IP optionis set). You can also route requests to any other slot by setting a custom request.meta['download_slot']
.
Default values for delay and concurrency for all slots are set using scrapy settings or spider attributes like download_delay
. But you can adjust them at run time in a Scrapy extension - this is what AutoThrottle is doing. To use different download delays for different requests you need to change delay
attribute in appropriate slots.
As your requests are sent to different domains the task is simplified - the slots are already different, you need to find them and change delay
values. If you want to use different delays for different parts of a single website you'd have to set custom slots using request.meta['download_slot']
.
Sorry, I won't provide a ready-to use example, but hope this helps. Feel free to ask more questions if it is unclear on where to go from this.
Also, it could be the case just enabling AutoThrottle extension is all what you need, and there is no need to write a custom extensions - try it first.
Another, simpler option is to create 2 spiders and set different download delays for them: the first downloads pages and extracts/stores links to images, the second downloads images.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With