Is there a way to get Scrapy to execute code once the crawl has completely finished to deal with moving / cleaning the data? Am sure it is trivial but my Google-fu seems to have left me for this issue.
It all depends on how you're launching Scrapy.
If running from a command line with crawl
or runspider
, just wait for the process to finish. Beware that 0 exit code won't mean you've crawled everything successfully.
If using as a library, you can append the code after CrawlerProcess.start()
call.
If you need to reliably track the status, first you have to do is to track spider_closed
signal and check its reason
parameter. There's an example at the start of the page, it expects you to modify the code of the spider.
To track all spiders you have added, when using as a library:
process = CrawlerProcess({})
process.crawl(MySpider)
def spider_ended(spider, reason):
print('Spider ended:', spider.name, reason)
for crawler in process.crawlers:
crawler.signals.connect(spider_ended, signal=scrapy.signals.spider_closed)
process.start()
Check the reason
, if it is not 'finished'
, something has interrupted the crawler.
The function will be called for each spider, so it may require some complex error handling if you have many. Also take in mind that after receiving two keyboard interrupts, Scrapy begins unclean shutdown and the function won't be called, but the code that is placed after process.start()
will run anyway.
Alternatively you can use the extensions mechanism to connect to these signals without messing with the rest of the code base. The sample extension shows how to track this signal.
But all of this was just to detect a failure because of interruption. You also need to subscribe to spider_error
signal that'll be called in case of a Python exception in a spider. And there is also network error handling that has to be done, see this question.
In the end I've ditched the idea of tracking failures and have just tracked success with a global variable that is checked after process.start()
returns. In my case the moment of success was not finding the "next page" link. But I had a linear scraper, so it was easy, your case may be different.
You can write an extension catching the spider_closed signal, which will execute your custom code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With