Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy concurrency strategy

What is better method of scaling Scrapy?

  1. By running one scrapy process and increasing CONCURRENT_REQUESTS internal Scrapy's setting
  2. By running multiple scrapy processes but still focusing on increasing internal setting.
  3. By increasing quantity of scrapy prcesses with some constant value of internal setting.

If 3 - then what software is better to use for launching multiple scrapy processes?

And what is a best way to distribute scrapy across multiple servers?

like image 310
Gill Bates Avatar asked Jul 11 '14 06:07

Gill Bates


People also ask

Is Scrapy multithreaded?

Scrapy does not support multi-threading because it is built on Twisted, which is an Asynchronous http protocol framework.

What is concurrent request in Scrapy?

In the context of Scrapy, this means to send out “concurrent” requests instead of sending them one by one. In other words, this means that the Scrapy spider will send a X number of (simultaneous) requests to the web server at the same time.

Is Scrapy asynchronous?

Event-driven networking. Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.

What is download delay in Scrapy?

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. DOWNLOAD_DELAY = 0.25 # 250 ms of delay.


1 Answers

Scrapyd is a great tool for managing Scrapy processes. But the best answer I can give is that it depends. First you need to figure out where your bottleneck is.

If it is CPU intensive parsing, you should use multiple processes. Scrapy is able to handle 1000s of requests in parallel through Twisted's implementation of the Reactor pattern. But it uses only one process and no multi-threading and so it will utilize only a single core.

If it is just the number of requests that is limiting speed, tweak concurrent requests. Test your internet speed. To test how much bandwidth you have Then, go to your network resources in your system monitor, run your spider and see how much bandwidth you use compared to your max. Increase your concurrent requests until you stop seeing performance increases. The stop point could be determined by the site capacity, though only for small sites, the sites anti-scraping/DDoS programs (assuming you don't have proxies or vpns), your bandwidth or another chokepoint in the system. The last thing to know is that, while requests are handled in an async manner, items are not. If you have a lot of text and write everything locally, it will block requests while it writes. You will see lulls on the system monitor network panel. You can tweak your concurrent items and maybe get a smoother network usage, but it will still take the same amount of time. If you are using db writes, consider an insert delayed, or a queue with an execute many after a threshold, or both. Here is a pipeline someone wrote to handle all db writes async. The last choke point could be memory. I have run into this issue on an AWS micro instance, though on a laptop, it probably isn't an issue. If you don't need them, considering disabling the cache, cookies, and dupefilter. Of course they can be very helpful. Concurrent Items and Requests also take up memory.

like image 88
Will Madaus Avatar answered Oct 22 '22 07:10

Will Madaus