Class Myspider1
#do something....
Class Myspider2
#do something...
The above is the architecture of my spider.py file. and i am trying to run the Myspider1 first and then run the Myspider2 multiples times depend on some conditions. How Could I do that??? any tips?
configure_logging()
runner = CrawlerRunner()
def crawl():
yield runner.crawl(Myspider1,arg.....)
yield runner.crawl(Myspider2,arg.....)
crawl()
reactor.run()
I am trying to use this way.but have no idea how to run it. Should I run the cmd on the cmd(what commands?) or just run the python file??
thanks a lot!!!
You need to use the Deferred
object returned by process.crawl(), which allows you to add a callback when the crawl is finished.
Here is my code
def start_sequentially(process: CrawlerProcess, crawlers: list):
print('start crawler {}'.format(crawlers[0].__name__))
deferred = process.crawl(crawlers[0])
if len(crawlers) > 1:
deferred.addCallback(lambda _: start_sequentially(process, crawlers[1:]))
def main():
crawlers = [Crawler1, Crawler2]
process = CrawlerProcess(settings=get_project_settings())
start_sequentially(process, crawlers)
process.start()
run the python file
for example:
test.py
import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
]
def parse(self, response):
print "first spider"
class MySpider2(scrapy.Spider):
# Your second spider definition
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
print "second spider"
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
Now run python test.py > output.txt
You can observe from the output.txt that your spiders run sequentially.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With