I'm very confused about the differences and interactions between DOWNLOAD_DELAY
and MAX_CONCURRENT_REQUESTS_PER_DOMAIN
in Scrapy.
Does download delay affect the maximum number of concurrent requests per domain, e.g., if I set a delay of 10 seconds but allow 8 concurrent requests per domain, will those concurrent requests not be fired concurrently but staggered according to the download delay, or will they be fired concurrently but the downloading of the responses be staggered? Is there any reason DOWNLOAD_DELAY
isn't called REQUEST_DELAY
?
For example, what would the back-of-the-envelope throughput calculation be in the following scenario:
start_urls
holds 100 URLs for a given domainMAX_CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 3
start_urls
How long would it take the associated spider to process this queue?
From the downloader source code
conc = self.ip_concurrency if self.ip_concurrency else self.domain_concurrency
conc, delay = _get_concurrency_delay(conc, spider, self.settings)
So it seems the behaviour would be the same as this, which says
This setting also affects DOWNLOAD_DELAY: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.
So I don't think you'll achieve much concurrency with a large download_delay. I've run crawlers on a slow network with autothrottling and there weren't more than 2-3 concurrent requests at a time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With