Is there a way to run all of the spiders in a Scrapy project without using the Scrapy daemon? There used to be a way to run multiple spiders with scrapy crawl
, but that syntax was removed and Scrapy's code changed quite a bit.
I tried creating my own command:
from scrapy.command import ScrapyCommand
from scrapy.utils.misc import load_object
from scrapy.conf import settings
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spman_cls = load_object(settings['SPIDER_MANAGER_CLASS'])
spiders = spman_cls.from_settings(settings)
for spider_name in spiders.list():
spider = self.crawler.spiders.create(spider_name)
self.crawler.crawl(spider)
self.crawler.start()
But once a spider is registered with self.crawler.crawl()
, I get assertion errors for all of the other spiders:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/home/blender/Projects/scrapers/store_crawler/store_crawler/commands/crawlall.py", line 22, in run
self.crawler.crawl(spider)
File "/usr/lib/python2.7/site-packages/scrapy/crawler.py", line 47, in crawl
return self.engine.open_spider(spider, requests)
File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1214, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1071, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python2.7/site-packages/scrapy/core/engine.py", line 215, in open_spider
spider.name
exceptions.AssertionError: No free spider slots when opening 'spidername'
Is there any way to do this? I'd rather not start subclassing core Scrapy components just to run all of my spiders like this.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
Why didn't you just use something like:
scrapy list|xargs -n 1 scrapy crawl
?
Here is an example that does not run inside a custom command, but runs the Reactor manually and creates a new Crawler for each spider:
from twisted.internet import reactor
from scrapy.crawler import Crawler
# scrapy.conf.settings singlton was deprecated last year
from scrapy.utils.project import get_project_settings
from scrapy import log
def setup_crawler(spider_name):
crawler = Crawler(settings)
crawler.configure()
spider = crawler.spiders.create(spider_name)
crawler.crawl(spider)
crawler.start()
log.start()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
for spider_name in crawler.spiders.list():
setup_crawler(spider_name)
reactor.run()
You will have to design some signal system to stop the reactor when all spiders are finished.
EDIT: And here is how you can run multiple spiders in a custom command:
from scrapy.command import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import Crawler
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
settings = get_project_settings()
for spider_name in self.crawler.spiders.list():
crawler = Crawler(settings)
crawler.configure()
spider = crawler.spiders.create(spider_name)
crawler.crawl(spider)
crawler.start()
self.crawler.start()
the answer of @Steven Almeroth will be failed in Scrapy 1.0, and you should edit the script like this:
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
class Command(ScrapyCommand):
requires_project = True
excludes = ['spider1']
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
settings = get_project_settings()
crawler_process = CrawlerProcess(settings)
for spider_name in crawler_process.spider_loader.list():
if spider_name in self.excludes:
continue
spider_cls = crawler_process.spider_loader.load(spider_name)
crawler_process.crawl(spider_cls)
crawler_process.start()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With