I'm trying to make my Scrapy spider to launch again if the closed reason is because of my internet connection (during night internet goes down for 5 minutes). When internet goes down the spider closes after 5 tries.
I'm trying to use this function inside my spider definition trying to restart the spider when closed:
def handle_spider_closed(spider, reason):
relaunch = False
for key in spider.crawler.stats._stats.keys():
if 'DNSLookupError' in key:
relaunch = True
break
if relaunch:
spider = mySpider()
settings = get_project_settings()
crawlerProcess = CrawlerProcess(settings)
crawlerProcess.configure()
crawlerProcess.crawl(spider)
spider.crawler.queue.append_spider(another_spider)
I tried a lot of things like re instantiate an spider but got the error Reactor is already running or something like that.
I thought about executing the spider from a script, and when the spider finishes call it again, but didn't work neither, because of the reactor is still in use.
Does anyone knows a good and easy way to do this ?
I found the solution to my issue ! What was I trying to do ?
I managed by handling the error of the spider like this:
import time
class mySpider(scrapy.Spider):
name = "myspider"
allowed_domains = ["google.com"]
start_urls = [
"http://www.google.com",
]
def handle_error(self, failure):
self.log("Error Handle: %s" % failure.request)
self.log("Sleeping 60 seconds")
time.sleep(60)
url = 'http://www.google.com'
yield scrapy.Request(url, self.parse, errback=self.handle_error, dont_filter=True)
def start_requests(self):
url = 'http://www.google.com'
yield scrapy.Request(url, self.parse, errback=self.handle_error)
dont_filter=True
to make the Spider allow to duplicate a request, only when it goes through error.errback=self.handle_error
makes the Spider go through the custom handle_error
functionIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With