Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run Multiple Spider sequentially

Class Myspider1
#do something....

Class Myspider2
#do something...

The above is the architecture of my spider.py file. and i am trying to run the Myspider1 first and then run the Myspider2 multiples times depend on some conditions. How Could I do that??? any tips?

configure_logging()
runner = CrawlerRunner()
def crawl():
    yield runner.crawl(Myspider1,arg.....)
    yield runner.crawl(Myspider2,arg.....)
crawl()
reactor.run()

I am trying to use this way.but have no idea how to run it. Should I run the cmd on the cmd(what commands?) or just run the python file??

thanks a lot!!!

like image 665
Huazhe Yin Avatar asked Mar 20 '16 01:03

Huazhe Yin


2 Answers

You need to use the Deferred object returned by process.crawl(), which allows you to add a callback when the crawl is finished.

Here is my code

def start_sequentially(process: CrawlerProcess, crawlers: list):
    print('start crawler {}'.format(crawlers[0].__name__))
    deferred = process.crawl(crawlers[0])
    if len(crawlers) > 1:
        deferred.addCallback(lambda _: start_sequentially(process, crawlers[1:]))

def main():
    crawlers = [Crawler1, Crawler2]
    process = CrawlerProcess(settings=get_project_settings())
    start_sequentially(process, crawlers)
    process.start()
like image 184
Qiang Zhang Avatar answered Oct 06 '22 01:10

Qiang Zhang


run the python file
for example: test.py

import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
               "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
    ]

    def parse(self, response):
        print "first spider"

class MySpider2(scrapy.Spider):
    # Your second spider definition
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
                "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        print "second spider"

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

Now run python test.py > output.txt
You can observe from the output.txt that your spiders run sequentially.

like image 25
Kaladin Avatar answered Oct 05 '22 23:10

Kaladin