Is there a way to run all of the spiders in a Scrapy project without using the Scrapy daemon? There used to be a way to run multiple spiders with <code>scrapy crawl</code>, but that syntax was removed and Scrapy's code changed quite a bit. I tried creating my own command: <pre class="prettyprint"><code>from scrapy.command import ScrapyCommand from scrapy.utils.misc import load_object from scrapy.conf import settings class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): spman_cls = load_object(settings['SPIDER_MANAGER_CLASS']) spiders = spman_cls.from_settings(settings) for spider_name in spiders.list(): spider = self.crawler.spiders.create(spider_name) self.crawler.crawl(spider) self.crawler.start() </code></pre> But once a spider is registered with <code>self.crawler.crawl()</code>, I get assertion errors for all of the other spiders: <pre class="prettyprint"><code>Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/scrapy/cmdline.py", line 138, in _run_command cmd.run(args, opts) File "/home/blender/Projects/scrapers/store_crawler/store_crawler/commands/crawlall.py", line 22, in run self.crawler.crawl(spider) File "/usr/lib/python2.7/site-packages/scrapy/crawler.py", line 47, in crawl return self.engine.open_spider(spider, requests) File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1214, in unwindGenerator return _inlineCallbacks(None, gen, Deferred()) --- <exception caught here> --- File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1071, in _inlineCallbacks result = g.send(result) File "/usr/lib/python2.7/site-packages/scrapy/core/engine.py", line 215, in open_spider spider.name exceptions.AssertionError: No free spider slots when opening 'spidername' </code></pre> Is there any way to do this? I'd rather not start subclassing core Scrapy components just to run all of my spiders like this.

Why didn't you just use something like: <pre class="prettyprint"><code>scrapy list|xargs -n 1 scrapy crawl </code></pre> ?

Here is an example that does not run inside a custom command, but runs the Reactor manually and creates a new Crawler for each spider: <pre class="prettyprint"><code>from twisted.internet import reactor from scrapy.crawler import Crawler # scrapy.conf.settings singlton was deprecated last year from scrapy.utils.project import get_project_settings from scrapy import log def setup_crawler(spider_name): crawler = Crawler(settings) crawler.configure() spider = crawler.spiders.create(spider_name) crawler.crawl(spider) crawler.start() log.start() settings = get_project_settings() crawler = Crawler(settings) crawler.configure() for spider_name in crawler.spiders.list(): setup_crawler(spider_name) reactor.run() </code></pre> You will have to design some signal system to stop the reactor when all spiders are finished. EDIT: And here is how you can run multiple spiders in a custom command: <pre class="prettyprint"><code>from scrapy.command import ScrapyCommand from scrapy.utils.project import get_project_settings from scrapy.crawler import Crawler class Command(ScrapyCommand): requires_project = True def syntax(self): return '[options]' def short_desc(self): return 'Runs all of the spiders' def run(self, args, opts): settings = get_project_settings() for spider_name in self.crawler.spiders.list(): crawler = Crawler(settings) crawler.configure() spider = crawler.spiders.create(spider_name) crawler.crawl(spider) crawler.start() self.crawler.start() </code></pre>

Locally run all of the spiders in Scrapy

Tags:

python

scrapy

web-crawler

Is there a way to run all of the spiders in a Scrapy project without using the Scrapy daemon? There used to be a way to run multiple spiders with scrapy crawl, but that syntax was removed and Scrapy's code changed quite a bit.

I tried creating my own command:

from scrapy.command import ScrapyCommand
from scrapy.utils.misc import load_object
from scrapy.conf import settings

class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spman_cls = load_object(settings['SPIDER_MANAGER_CLASS'])
        spiders = spman_cls.from_settings(settings)

        for spider_name in spiders.list():
            spider = self.crawler.spiders.create(spider_name)
            self.crawler.crawl(spider)

        self.crawler.start()

But once a spider is registered with self.crawler.crawl(), I get assertion errors for all of the other spiders:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/scrapy/cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "/home/blender/Projects/scrapers/store_crawler/store_crawler/commands/crawlall.py", line 22, in run
    self.crawler.crawl(spider)
  File "/usr/lib/python2.7/site-packages/scrapy/crawler.py", line 47, in crawl
    return self.engine.open_spider(spider, requests)
  File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1214, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 1071, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python2.7/site-packages/scrapy/core/engine.py", line 215, in open_spider
    spider.name
exceptions.AssertionError: No free spider slots when opening 'spidername'

Is there any way to do this? I'd rather not start subclassing core Scrapy components just to run all of my spiders like this.

360

asked Mar 22 '13 07:03

Blender

3 Answers

Why didn't you just use something like:

scrapy list|xargs -n 1 scrapy crawl

answered Oct 27 '22 06:10

side2k

Here is an example that does not run inside a custom command, but runs the Reactor manually and creates a new Crawler for each spider:

from twisted.internet import reactor
from scrapy.crawler import Crawler
# scrapy.conf.settings singlton was deprecated last year
from scrapy.utils.project import get_project_settings
from scrapy import log

def setup_crawler(spider_name):
    crawler = Crawler(settings)
    crawler.configure()
    spider = crawler.spiders.create(spider_name)
    crawler.crawl(spider)
    crawler.start()

log.start()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()

for spider_name in crawler.spiders.list():
    setup_crawler(spider_name)

reactor.run()

You will have to design some signal system to stop the reactor when all spiders are finished.

EDIT: And here is how you can run multiple spiders in a custom command:

from scrapy.command import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import Crawler

class Command(ScrapyCommand):

    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        settings = get_project_settings()

        for spider_name in self.crawler.spiders.list():
            crawler = Crawler(settings)
            crawler.configure()
            spider = crawler.spiders.create(spider_name)
            crawler.crawl(spider)
            crawler.start()

        self.crawler.start()

answered Oct 27 '22 04:10

Steven Almeroth

the answer of @Steven Almeroth will be failed in Scrapy 1.0, and you should edit the script like this:

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

class Command(ScrapyCommand):

    requires_project = True
    excludes = ['spider1']

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        settings = get_project_settings()
        crawler_process = CrawlerProcess(settings) 

        for spider_name in crawler_process.spider_loader.list():
            if spider_name in self.excludes:
                continue
            spider_cls = crawler_process.spider_loader.load(spider_name) 
            crawler_process.crawl(spider_cls)
        crawler_process.start()

answered Oct 27 '22 05:10

Soarone

Related questions
                            
                                Grab unique tuples in python list, irrespective of order
                            
                                In Tensorflow, how to use tf.gather() for the last dimension?
                            
                                Is Qt Designer bundled with Anaconda?
                            
                                Python Websockets Module has no attribute
                            
                                How to compute volatility (standard deviation) in rolling window in Pandas
                            
                                ImportError: cannot import name 'url_encode' from 'werkzeug'
                            
                                "pythonic" method to parse a string of comma-separated integers into a list of integers?
                            
                                How to make built-in containers (sets, dicts, lists) thread safe?
                            
                                Cleaner way to read/gunzip a huge file in python
                            
                                Pandas DataFrame use previous row value for complicated 'if' conditions to determine current value
                            
                                Pythonsetuptools pkg_resources pip wheel failed with error code 1 [Error]
                            
                                Copy a file from one location to another in Python
                            
                                python + sqlite, insert data from variables into table
                            
                                How to bind Ctrl+/ in python tkinter?
                            
                                Python Array Rotation
                            
                                Hadoop: Python client driver for HiveServer2 fails to install
                            
                                AttributeError: module 'statsmodels.formula.api' has no attribute 'OLS'
                            
                                Python: get default gateway for a local interface/ip address in linux
                            
                                Block tridiagonal matrix python
                            
                                How to pass a boolean from javascript to python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With