Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Scrapy compatible with multiprocessing?

So I have been using selenium to make my scraping. BUT I want to change all the code to Scrapy. The only thing I'm no sure about is that I'm using multiprocessing (python library) to speed up my process. I have researched a lot but I quite don't get it. I have found: Multiprocessing of Scrapy Spiders in Parallel Processes but it doesn't help me because it says that it can be done with Twisted but I haven't found an example yet.

In other forums it says that Scrapy can work with multiprocessing.

Last thing, in scrapy the option CONCURRENT_REQUESTS (settings) has some connection with multiprocessing?

like image 955
AngelLB Avatar asked Apr 06 '26 10:04

AngelLB


1 Answers

The recommended way for working with scrapy is to NOT use multiprocessing inside the running spiders.

The better alternative would be to invoke several scrapy jobs with the respective separated inputs.

Scrapy jobs themselves are very fast IMO, of course, you can always go faster, special settings as you mentioned CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY, etc. But this is basically because scrapy is asynchronous, meaning it won't wait for the requests to be completed to schedule and continue working on the remaining tasks (scheduling more requests, parsing responses, etc.)

The CONCURRENT_REQUESTS doesn't have a connection with multiprocessing. It is mostly a way to "limit" the speed of how many requests could be scheduled, because of being asynchronous.

like image 198
eLRuLL Avatar answered Apr 09 '26 00:04

eLRuLL