Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running code when Scrapy spider has finished crawling

Is there a way to get Scrapy to execute code once the crawl has completely finished to deal with moving / cleaning the data? Am sure it is trivial but my Google-fu seems to have left me for this issue.

like image 458
Jonno Avatar asked Jun 28 '13 11:06

Jonno


2 Answers

It all depends on how you're launching Scrapy.

If running from a command line with crawl or runspider, just wait for the process to finish. Beware that 0 exit code won't mean you've crawled everything successfully.

If using as a library, you can append the code after CrawlerProcess.start() call.

If you need to reliably track the status, first you have to do is to track spider_closed signal and check its reason parameter. There's an example at the start of the page, it expects you to modify the code of the spider.

To track all spiders you have added, when using as a library:

process = CrawlerProcess({})
process.crawl(MySpider)

def spider_ended(spider, reason):
    print('Spider ended:', spider.name, reason)

for crawler in process.crawlers:
    crawler.signals.connect(spider_ended, signal=scrapy.signals.spider_closed)

process.start()

Check the reason, if it is not 'finished', something has interrupted the crawler.
The function will be called for each spider, so it may require some complex error handling if you have many. Also take in mind that after receiving two keyboard interrupts, Scrapy begins unclean shutdown and the function won't be called, but the code that is placed after process.start() will run anyway.

Alternatively you can use the extensions mechanism to connect to these signals without messing with the rest of the code base. The sample extension shows how to track this signal.

But all of this was just to detect a failure because of interruption. You also need to subscribe to spider_error signal that'll be called in case of a Python exception in a spider. And there is also network error handling that has to be done, see this question.

In the end I've ditched the idea of tracking failures and have just tracked success with a global variable that is checked after process.start() returns. In my case the moment of success was not finding the "next page" link. But I had a linear scraper, so it was easy, your case may be different.

like image 59
user Avatar answered Sep 28 '22 04:09

user


You can write an extension catching the spider_closed signal, which will execute your custom code.

like image 21
Balthazar Rouberol Avatar answered Sep 28 '22 03:09

Balthazar Rouberol