Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - different download_delay for different domain

Tags:

python

scrapy

I am using scrapy to download some articles from a website as well as images within the articles.

Some info about the scenario:

  • Articles come from the main domain (jandan.net)
  • Images within the articles come from other websites. (e.g. tankr.net)
  • Main domain has an access frequency limit so I have to set download_delay to avoid <403> error
  • Images downloaded by scrapy.contrib.pipeline.images.ImagesPipeline
  • Seems like the image downloading is also limited by the download_delay setting

How can I speed up the image download while I have to limit the download speed from main domain?

like image 462
葛明洋 Avatar asked Jan 08 '23 02:01

葛明洋


1 Answers

There is no public interface for it (so my answer can become invalid in future Scrapy versions), but you can check the implementation of the built-in AutoThrottle extension.

It is a bit complicated, but in Scrapy 1.0 the idea is the following: there is Downloader which handles all downloads. To decide how many requests to send in parallel and which delays to use Downloader uses "slots". By changing slot attributes (delay, concurrency) you can change Downloader behaviour. By default, there is a slot per domain (or per IP address if CONCURRENT_REQUESTS_PER_IP optionis set). You can also route requests to any other slot by setting a custom request.meta['download_slot'].

Default values for delay and concurrency for all slots are set using scrapy settings or spider attributes like download_delay. But you can adjust them at run time in a Scrapy extension - this is what AutoThrottle is doing. To use different download delays for different requests you need to change delay attribute in appropriate slots.

As your requests are sent to different domains the task is simplified - the slots are already different, you need to find them and change delay values. If you want to use different delays for different parts of a single website you'd have to set custom slots using request.meta['download_slot'].

Sorry, I won't provide a ready-to use example, but hope this helps. Feel free to ask more questions if it is unclear on where to go from this.

Also, it could be the case just enabling AutoThrottle extension is all what you need, and there is no need to write a custom extensions - try it first.

Another, simpler option is to create 2 spiders and set different download delays for them: the first downloads pages and extracts/stores links to images, the second downloads images.

like image 161
Mikhail Korobov Avatar answered Jan 10 '23 15:01

Mikhail Korobov