Can someone explain to me how the pause/resume feature in Scrapy
works?
The version of scrapy
that I'm using is 0.24.5
The documentation does not provide much detail.
I have the following simple spider:
class SampleSpider(Spider):
name = 'sample'
def start_requests(self):
yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1053')
yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1054')
yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1055')
def parse(self, response):
with open('responses.txt', 'a') as f:
f.write(response.url + '\n')
I'm running it using:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapyproject.spiders.sample_spider import SampleSpider
spider = SampleSpider()
settings = get_project_settings()
settings.set('JOBDIR', '/some/path/scrapy_cache')
settings.set('DOWNLOAD_DELAY', 10)
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
As you can see, I enabled the JOBDIR option so that I can save the state of my crawl.
I set the DOWNLOAD_DELAY
to 10 seconds
so that I can stop the spider before the requests are processed. I would have expected that the next time I run the spider, the requests will not be regenerated. That is not the case.
I see in my scrapy_cache folder a folder named requests.queue. However, that is always empty.
It looks like the requests.seen file is saving the issued requests (using SHA1
hashes) which is great. However, the next time I run the spider, the requests are regenerated and the (duplicate) SHA1
hashes are added to the file. I tracked this issue in the Scrapy
code and it looks like the RFPDupeFilter
opens the requests.seen file with an 'a+' flag. So it will always discard the previous values in the file (at least that is the behavior on my Mac OS X).
Finally, regarding spider state, I can see from the Scrapy
code that the spider state is saved when the spider is closed and is read back when it's opened. However, that is not very helpful if an exception occurs (e.g., the machine shuts down). Do I have to be saving periodically?
The main question I have here is: What's the common practice to use Scrapy
while expecting that the crawl will stop/resume multiple times (e.g., when crawling a very big website)?
For being able to pause and resume the scrapy search, you can run this command for starting the search:
scrapy crawl somespider --set JOBDIR=crawl1
for stopping the search you should run control-C, but you have to run it just once and wait for scrapy to stop, if you run control-C twice it wont work properly.
then you can resume your search by running this command again:
scrapy crawl somespider --set JOBDIR=crawl1
The version of scrapy that I'm using is 1.1.0
you need to set the correct JOBDIR
in settings.py
JOBDIR = 'PROJECT_DIR'
After stoping spider by control+c
, you can run the spider to continue scraping the rest again.
It should work after that
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With