Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

About scrapy's concurrency model

Tags:

scrapy

Now I plan to use scrapy in a more distributed approach, and I'm not sure if the spiders/pipelines/downloaders/schedulers and engine are all hosted in separate processes or threads, could anyone share some info about this? and could we change the process/thread count for each component? I know now there are two settings "CONCURRENT_REQUESTS" and "CONCURRENT_ITEMS", they will determine the concurrent threads for downloaders and pipelines, right? and if I want to deploy spiders/ pipelines/downloaders in different machines, I need to serialize the items/requests/responses, right? Appreciate very much for your helps!!

Thanks, Edward.

like image 716
user1441208 Avatar asked Oct 07 '22 15:10

user1441208


1 Answers

Scrapy is single threaded. It uses the Reactor pattern to achieve concurrent network requests. This is done using the Twisted Framework.

People wanting to distribute Scrapy usually try to implement some messaging framework. Some use Redis, some others try RabbitMQ

Also have a look at Scrapyd

like image 141
escitalopram Avatar answered Oct 12 '22 11:10

escitalopram