Scrapy download_delay vs. max_concurrent_requests_per_domain

Question

I'm very confused about the differences and interactions between DOWNLOAD_DELAY and MAX_CONCURRENT_REQUESTS_PER_DOMAIN in Scrapy.

Does download delay affect the maximum number of concurrent requests per domain, e.g., if I set a delay of 10 seconds but allow 8 concurrent requests per domain, will those concurrent requests not be fired concurrently but staggered according to the download delay, or will they be fired concurrently but the downloading of the responses be staggered? Is there any reason DOWNLOAD_DELAY isn't called REQUEST_DELAY?

For example, what would the back-of-the-envelope throughput calculation be in the following scenario:

start_urls holds 100 URLs for a given domain
MAX_CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 3
assume the server takes 2 seconds to generate a response
assume we don't generate any more URLs than what's already in start_urls

How long would it take the associated spider to process this queue?

pad · Accepted Answer

From the downloader source code

conc = self.ip_concurrency if self.ip_concurrency else self.domain_concurrency
conc, delay = _get_concurrency_delay(conc, spider, self.settings)

So it seems the behaviour would be the same as this, which says

This setting also affects DOWNLOAD_DELAY: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

So I don't think you'll achieve much concurrency with a large download_delay. I've run crawlers on a slow network with autothrottling and there weren't more than 2-3 concurrent requests at a time.

Scrapy download_delay vs. max_concurrent_requests_per_domain

Tags:

python

concurrency

scrapy

yangmillstheory

1 Answers

pad

Recent Activity

Donate For Us

Scrapy download_delay vs. max_concurrent_requests_per_domain

Tags:

python

concurrency

scrapy

yangmillstheory

1 Answers

pad

Related questions

Recent Activity

Donate For Us