Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running dozens of Scrapy spiders in a controlled manner

I'm trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this other one), but they all seem to use the same recommendation (from the Scrapy docs): set up a CrawlerProcess, add the spiders to it, and hit start().

When I tried this method with all 325 of my spiders, though, it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it. I've tried a few things that haven't worked.

What is the recommended way to run a large number of spiders with Scrapy?

Edited to add: I understand I can scale up to multiple machines and pay for services to help coordinate (e.g. ScrapingHub), but I'd prefer to run this on one machine using some sort of process pool + queue so that only a small fixed number of spiders are ever running at the same time.

like image 393
magneticMonster Avatar asked Jan 04 '18 04:01

magneticMonster


People also ask

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

How do you execute a Scrapy spider?

You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl . Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. The first utility you can use to run your spiders is scrapy.

What is spider in web scraping?

Spider is a smart point-and-click web scraping tool. With Spider, you can turn websites into organized data, download it as JSON or spreadsheet. There's no coding experience or configuration time involved, simply open the chrome extension and start clicking.

How do I run multiple spiders per process in Scrapy?

By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API. Here is an example that runs multiple spiders simultaneously: Same example but running the spiders sequentially by chaining the deferreds:

What is a Scrapy Spider?

This is the simplest spider, and the one from which every other spider must inherit (including spiders that come bundled with Scrapy, as well as spiders that you write yourself). It doesn’t provide any special functionality.

Can I run multiple spiders at the same time?

Here is an example that runs multiple spiders simultaneously: Same example but running the spiders sequentially by chaining the deferreds: Different spiders can set different values for the same setting, but when they run in the same process it may be impossible, by design or because of some limitations, to use these different values.

What is the use of scrapyd?

Scrapyd is application that allows us to deploy Scrapy spiders on a server and run them remotely using a JSON API. Scrapyd allows you to: Run Scrapy jobs. Pause & Cancel Scrapy jobs. Manage Scrapy project/spider versions. Access Scrapy logs remotely.


1 Answers

The simplest way to do this is to run them all from the command line. For example:

$ scrapy list | xargs -P 4 -n 1 scrapy crawl

Will run all your spiders, with up to 4 running in parallel at any time. You can then send a notification in a script once this command has completed.

A more robust option is to use scrapyd. This comes with an API, a minimal web interface, etc. It will also queue the crawls and only run a certain (configurable) number at once. You can interact with it via the API to start your spiders and send notifications once they are all complete.

Scrapy Cloud is a perfect fit for this [disclaimer: I work for Scrapinghub]. It will allow you only to run a certain number at once and has a queue of pending jobs (which you can modify, browse online, prioritize, etc.) and a more complete API than scrapyd.

You shouldn't run all your spiders in a single process. It will probably be slower, can introduce unforeseen bugs, and you may hit resource limits (like you did). If you run them separately using any of the options above, just run enough to max out your hardware resources (usually CPU/network). If you still get problems with file descriptors at that point you should increase the limit.

like image 92
Shane Evans Avatar answered Oct 21 '22 02:10

Shane Evans