Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy download_delay vs. max_concurrent_requests_per_domain

I'm very confused about the differences and interactions between DOWNLOAD_DELAY and MAX_CONCURRENT_REQUESTS_PER_DOMAIN in Scrapy.

Does download delay affect the maximum number of concurrent requests per domain, e.g., if I set a delay of 10 seconds but allow 8 concurrent requests per domain, will those concurrent requests not be fired concurrently but staggered according to the download delay, or will they be fired concurrently but the downloading of the responses be staggered? Is there any reason DOWNLOAD_DELAY isn't called REQUEST_DELAY?

For example, what would the back-of-the-envelope throughput calculation be in the following scenario:

  • start_urls holds 100 URLs for a given domain
  • MAX_CONCURRENT_REQUESTS_PER_DOMAIN = 8
  • DOWNLOAD_DELAY = 3
  • assume the server takes 2 seconds to generate a response
  • assume we don't generate any more URLs than what's already in start_urls

How long would it take the associated spider to process this queue?

like image 207
yangmillstheory Avatar asked Jan 02 '15 00:01

yangmillstheory


1 Answers

From the downloader source code

conc = self.ip_concurrency if self.ip_concurrency else self.domain_concurrency
conc, delay = _get_concurrency_delay(conc, spider, self.settings)

So it seems the behaviour would be the same as this, which says

This setting also affects DOWNLOAD_DELAY: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

So I don't think you'll achieve much concurrency with a large download_delay. I've run crawlers on a slow network with autothrottling and there weren't more than 2-3 concurrent requests at a time.

like image 186
pad Avatar answered Sep 28 '22 09:09

pad