Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run multiple spiders in the same process in Scrapy

I'm beginner in Python & Scrapy. I've just create a Scrapy project with multiple spiders, when running "scrapy crawl .." it runs only the first spider.

How can I run all spiders in the same process?

Thanks in advance.

like image 270
elhoucine Avatar asked Dec 10 '25 05:12

elhoucine


2 Answers

You will have a name for every spider in the file that says name="youspidername". and when you call it using scrapy crawl yourspidername, it will crawl only that spider. you will have to again give a command to run the other spider using scrapy crawl youotherspidername.

The other way is to just mention all the spiders in the same command like scrapy crawl yourspidername,yourotherspidername,etc.. (this method is not supported for the newer versions of scrapy)

like image 172
Abhishek Avatar answered Dec 11 '25 18:12

Abhishek


Everyone, even the docs, suggest using the internal API to author a "run script" which controls the start and stop of multiple spiders. However, this comes with a lot of caveats unless you get it absolutely correct (feedexports not working, the twisted reactor either not stopping or stopping too soon etc).

In my opinion, we have a known working and supported scrapy crawl x command and therefore a much easier way to handle this is to use GNU Parallel to parellize.

After install, to run (from the shell) one scrapy spider per core and assuming you wish to run all the ones in your project:

scrapy list | parallel --line-buffer scrapy crawl

If you only have one core, you can play around with the --jobs argument to GNU Parallel. For example, the following will run 2 scrapy jobs per core:

scrapy list | parallel --jobs 200% --line-buffer scrapy crawl
like image 43
Darian Moody Avatar answered Dec 11 '25 17:12

Darian Moody



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!