So I have been using selenium to make my scraping. BUT I want to change all the code to Scrapy. The only thing I'm no sure about is that I'm using multiprocessing (python library) to speed up my process. I have researched a lot but I quite don't get it. I have found: Multiprocessing of Scrapy Spiders in Parallel Processes but it doesn't help me because it says that it can be done with Twisted but I haven't found an example yet.
In other forums it says that Scrapy can work with multiprocessing.
Last thing, in scrapy the option CONCURRENT_REQUESTS (settings) has some connection with multiprocessing?
The recommended way for working with scrapy is to NOT use multiprocessing inside the running spiders.
The better alternative would be to invoke several scrapy jobs with the respective separated inputs.
Scrapy jobs themselves are very fast IMO, of course, you can always go faster, special settings as you mentioned CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN, DOWNLOAD_DELAY, etc. But this is basically because scrapy is asynchronous, meaning it won't wait for the requests to be completed to schedule and continue working on the remaining tasks (scheduling more requests, parsing responses, etc.)
The CONCURRENT_REQUESTS doesn't have a connection with multiprocessing. It is mostly a way to "limit" the speed of how many requests could be scheduled, because of being asynchronous.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With