Running dozens of Scrapy spiders in a controlled manner

Tags:

I'm trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this other one), but they all seem to use the same recommendation (from the Scrapy docs): set up a CrawlerProcess, add the spiders to it, and hit start().

When I tried this method with all 325 of my spiders, though, it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it. I've tried a few things that haven't worked.

What is the recommended way to run a large number of spiders with Scrapy?

Edited to add: I understand I can scale up to multiple machines and pay for services to help coordinate (e.g. ScrapingHub), but I'd prefer to run this on one machine using some sort of process pool + queue so that only a small fixed number of spiders are ever running at the same time.

393

asked Jan 04 '18 04:01

magneticMonster

1 Answers

The simplest way to do this is to run them all from the command line. For example:

$ scrapy list | xargs -P 4 -n 1 scrapy crawl

Will run all your spiders, with up to 4 running in parallel at any time. You can then send a notification in a script once this command has completed.

A more robust option is to use scrapyd. This comes with an API, a minimal web interface, etc. It will also queue the crawls and only run a certain (configurable) number at once. You can interact with it via the API to start your spiders and send notifications once they are all complete.

Scrapy Cloud is a perfect fit for this [disclaimer: I work for Scrapinghub]. It will allow you only to run a certain number at once and has a queue of pending jobs (which you can modify, browse online, prioritize, etc.) and a more complete API than scrapyd.

You shouldn't run all your spiders in a single process. It will probably be slower, can introduce unforeseen bugs, and you may hit resource limits (like you did). If you run them separately using any of the options above, just run enough to max out your hardware resources (usually CPU/network). If you still get problems with file descriptors at that point you should increase the limit.

answered Oct 21 '22 02:10

Shane Evans

Related questions
                            
                                Why doesn't first and last in a groupby give me first and last
                            
                                enumerate groups in a dataframe
                            
                                Python - Gspread Request Error 401
                            
                                What happens when you try to delete a list element while iterating over it
                            
                                Keras misinterprets training data shape
                            
                                Writing a set to an output file in python
                            
                                Python pandas.DataFrame: Make whole row NaN according to condition
                            
                                Create new column in dataframe with match values from other dataframe
                            
                                Print all variables defined in one Jupyter cell
                            
                                Splitting csv file based on a particular column using Python
                            
                                scipy interp1d extrapolation method fill_value = tuple not working
                            
                                Python: openpyxl change cell fill to 'none' and borders to 'none'
                            
                                sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) syntax error at or near ":"
                            
                                How to get selected item QlistWidget pyqt
                            
                                Saving plot with high resolution image [duplicate]
                            
                                in python, how to connect points with smooth line in plotting?
                            
                                Finding ONLY Unique Coordinates in List
                            
                                how to load json file greater than 10gb in pandas/python of a particular pattern
                            
                                How do I pass a string as an argument name?
                            
                                Python multiple inheritance is not showing class variables or method of second inherited base class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Running dozens of Scrapy spiders in a controlled manner

Tags:

python

web-scraping

scrapy

magneticMonster

People also ask

1 Answers

Shane Evans

Recent Activity

Donate For Us