Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Spider: Restart spider when finishes

I'm trying to make my Scrapy spider to launch again if the closed reason is because of my internet connection (during night internet goes down for 5 minutes). When internet goes down the spider closes after 5 tries.

I'm trying to use this function inside my spider definition trying to restart the spider when closed:

def handle_spider_closed(spider, reason):
    relaunch = False
    for key in spider.crawler.stats._stats.keys():
        if 'DNSLookupError' in key:
            relaunch = True
            break

    if relaunch:
        spider = mySpider()
        settings = get_project_settings()
        crawlerProcess = CrawlerProcess(settings)
        crawlerProcess.configure()
        crawlerProcess.crawl(spider)
        spider.crawler.queue.append_spider(another_spider)

I tried a lot of things like re instantiate an spider but got the error Reactor is already running or something like that.

I thought about executing the spider from a script, and when the spider finishes call it again, but didn't work neither, because of the reactor is still in use.

  • My intention is to reset the spider after it closes (the spider closes because it lost internet connection)

Does anyone knows a good and easy way to do this ?

like image 242
AlvaroAV Avatar asked Mar 11 '15 12:03

AlvaroAV


1 Answers

I found the solution to my issue ! What was I trying to do ?

  • Handle the spider when fails or closes
  • Try to reexecute the Spider when closes

I managed by handling the error of the spider like this:

import time

class mySpider(scrapy.Spider):
    name = "myspider"
    allowed_domains = ["google.com"]
    start_urls = [
        "http://www.google.com",
    ]

    def handle_error(self, failure):
        self.log("Error Handle: %s" % failure.request)
        self.log("Sleeping 60 seconds")
        time.sleep(60)
        url = 'http://www.google.com'
        yield scrapy.Request(url, self.parse, errback=self.handle_error, dont_filter=True)

    def start_requests(self):
        url = 'http://www.google.com'
        yield scrapy.Request(url, self.parse, errback=self.handle_error)
  • I used dont_filter=True to make the Spider allow to duplicate a request, only when it goes through error.
  • errback=self.handle_error makes the Spider go through the custom handle_error function
like image 143
AlvaroAV Avatar answered Oct 04 '22 05:10

AlvaroAV