Now I plan to use scrapy in a more distributed approach, and I'm not sure if the spiders/pipelines/downloaders/schedulers and engine are all hosted in separate processes or threads, could anyone share some info about this? and could we change the process/thread count for each component? I know now there are two settings "CONCURRENT_REQUESTS" and "CONCURRENT_ITEMS", they will determine the concurrent threads for downloaders and pipelines, right? and if I want to deploy spiders/ pipelines/downloaders in different machines, I need to serialize the items/requests/responses, right? Appreciate very much for your helps!!
Thanks, Edward.
Scrapy is single threaded. It uses the Reactor pattern to achieve concurrent network requests. This is done using the Twisted Framework.
People wanting to distribute Scrapy usually try to implement some messaging framework. Some use Redis, some others try RabbitMQ
Also have a look at Scrapyd
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With