Scrapy Spider: Restart spider when finishes

Question

I'm trying to make my Scrapy spider to launch again if the closed reason is because of my internet connection (during night internet goes down for 5 minutes). When internet goes down the spider closes after 5 tries.

I'm trying to use this function inside my spider definition trying to restart the spider when closed:

def handle_spider_closed(spider, reason):
    relaunch = False
    for key in spider.crawler.stats._stats.keys():
        if 'DNSLookupError' in key:
            relaunch = True
            break

    if relaunch:
        spider = mySpider()
        settings = get_project_settings()
        crawlerProcess = CrawlerProcess(settings)
        crawlerProcess.configure()
        crawlerProcess.crawl(spider)
        spider.crawler.queue.append_spider(another_spider)

I tried a lot of things like re instantiate an spider but got the error Reactor is already running or something like that.

I thought about executing the spider from a script, and when the spider finishes call it again, but didn't work neither, because of the reactor is still in use.

My intention is to reset the spider after it closes (the spider closes because it lost internet connection)

Does anyone knows a good and easy way to do this ?

AlvaroAV · Accepted Answer

I found the solution to my issue ! What was I trying to do ?

Handle the spider when fails or closes
Try to reexecute the Spider when closes

I managed by handling the error of the spider like this:

import time

class mySpider(scrapy.Spider):
    name = "myspider"
    allowed_domains = ["google.com"]
    start_urls = [
        "http://www.google.com",
    ]

    def handle_error(self, failure):
        self.log("Error Handle: %s" % failure.request)
        self.log("Sleeping 60 seconds")
        time.sleep(60)
        url = 'http://www.google.com'
        yield scrapy.Request(url, self.parse, errback=self.handle_error, dont_filter=True)

    def start_requests(self):
        url = 'http://www.google.com'
        yield scrapy.Request(url, self.parse, errback=self.handle_error)

I used dont_filter=True to make the Spider allow to duplicate a request, only when it goes through error.
errback=self.handle_error makes the Spider go through the custom handle_error function

Scrapy Spider: Restart spider when finishes

Tags:

python

python-2.7

web-scraping

scrapy

AlvaroAV

1 Answers

AlvaroAV

Recent Activity

Donate For Us

Scrapy Spider: Restart spider when finishes

Tags:

python

python-2.7

web-scraping

scrapy

AlvaroAV

1 Answers

AlvaroAV

Related questions

Recent Activity

Donate For Us