I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using: <pre class="prettyprint"><code>curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2 </code></pre> But how do I schedule all spiders in a project at once? All help much appreciated!

Sorry, I know this is an old topic, but I've started learning scrapy recently and stumbled here, and I don't have enough rep yet to post a comment, so posting an answer. From the common scrapy practices you'll see that if you need to run multiple spiders at once, you'll have to start multiple scrapyd service instances and then distribute your Spider runs among those.

Run multiple scrapy spiders at once using scrapyd

Tags:

python

scrapy

screen-scraping

scrapyd

I'm using scrapy for a project where I want to scrape a number of sites - possibly hundreds - and I have to write a specific spider for each site. I can schedule one spider in a project deployed to scrapyd using:

curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2

But how do I schedule all spiders in a project at once?

All help much appreciated!

716

asked May 29 '12 14:05

user1009453

2 Answers

My solution for running 200+ spiders at once has been to create a custom command for the project. See http://doc.scrapy.org/en/latest/topics/commands.html#custom-project-commands for more information about implementing custom commands.

YOURPROJECTNAME/commands/allcrawl.py :

from scrapy.command import ScrapyCommand
import urllib
import urllib2
from scrapy import log

class AllCrawlCommand(ScrapyCommand):

    requires_project = True
    default_settings = {'LOG_ENABLED': False}

    def short_desc(self):
        return "Schedule a run for all available spiders"

    def run(self, args, opts):
        url = 'http://localhost:6800/schedule.json'
        for s in self.crawler.spiders.list():
            values = {'project' : 'YOUR_PROJECT_NAME', 'spider' : s}
            data = urllib.urlencode(values)
            req = urllib2.Request(url, data)
            response = urllib2.urlopen(req)
            log.msg(response)

Make sure to include the following in your settings.py

COMMANDS_MODULE = 'YOURPROJECTNAME.commands'

Then from the command line (in your project directory) you can simply type

scrapy allcrawl

answered Oct 23 '22 00:10

dru

Sorry, I know this is an old topic, but I've started learning scrapy recently and stumbled here, and I don't have enough rep yet to post a comment, so posting an answer.

From the common scrapy practices you'll see that if you need to run multiple spiders at once, you'll have to start multiple scrapyd service instances and then distribute your Spider runs among those.

answered Oct 23 '22 01:10

Cajetan Rodrigues

Related questions
                            
                                Redirecting django manage.py output (in windows) to a text file
                            
                                how to open url in new tab from django?
                            
                                How to list all dlls loaded by a process with Python?
                            
                                Build a PyObject* from a C function?
                            
                                Python: if more than one of three things is true, return false
                            
                                python operator precedence of in and comparison
                            
                                Can I define a scope anywhere in Python?
                            
                                Using Amazon S3 with Heroku, Python, and Flask
                            
                                How to combine callLater and addCallback?
                            
                                Mac OS X, pip: specify compiler for packages containing C libraries
                            
                                os.walk() in reverse?
                            
                                How to get all child components of QWidget in pyside/pyqt/qt?
                            
                                Django and models with multiple foreign keys
                            
                                Python - create dictionary from list of dictionaries
                            
                                Remove element from tuple in a list
                            
                                How do I check if a user left the 'input' or 'raw_input' prompt empty?
                            
                                Python asks for older paths on mac after deleting duplicate python installation
                            
                                How to use can_add_related in Django Admin
                            
                                Change icon for a cx_Freeze script
                            
                                Difference between calling sys.exit() and throwing exception

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With