I have a Scrapy CrawlSpider that has a very large list of URLs to crawl. I would like to be able to stop it, saving the current status and resume it later without having to start over. Is there a way to accomplish this within the Scrapy framework?
Just wanted to share that feature is included in latest scrapy version, but parameter name is changed. You should use it like this:
scrapy crawl thespider --set JOBDIR=run1
For more information here http://doc.scrapy.org/en/latest/topics/jobs.html#job-directory
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With