Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Scrapy pause/resume work?

Tags:

scrapy

Can someone explain to me how the pause/resume feature in Scrapy works?

The version of scrapy that I'm using is 0.24.5

The documentation does not provide much detail.

I have the following simple spider:

class SampleSpider(Spider):
name = 'sample'

def start_requests(self):
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1053')
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1054')
        yield Request(url='https://colostate.textbookrack.com/listingDetails?lst_id=1055')

def parse(self, response):
    with open('responses.txt', 'a') as f:
        f.write(response.url + '\n')

I'm running it using:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals


from scrapyproject.spiders.sample_spider import SampleSpider
spider = SampleSpider()
settings = get_project_settings()
settings.set('JOBDIR', '/some/path/scrapy_cache')
settings.set('DOWNLOAD_DELAY', 10)
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() 

As you can see, I enabled the JOBDIR option so that I can save the state of my crawl.

I set the DOWNLOAD_DELAY to 10 seconds so that I can stop the spider before the requests are processed. I would have expected that the next time I run the spider, the requests will not be regenerated. That is not the case.

I see in my scrapy_cache folder a folder named requests.queue. However, that is always empty.

It looks like the requests.seen file is saving the issued requests (using SHA1 hashes) which is great. However, the next time I run the spider, the requests are regenerated and the (duplicate) SHA1 hashes are added to the file. I tracked this issue in the Scrapy code and it looks like the RFPDupeFilter opens the requests.seen file with an 'a+' flag. So it will always discard the previous values in the file (at least that is the behavior on my Mac OS X).

Finally, regarding spider state, I can see from the Scrapy code that the spider state is saved when the spider is closed and is read back when it's opened. However, that is not very helpful if an exception occurs (e.g., the machine shuts down). Do I have to be saving periodically?

The main question I have here is: What's the common practice to use Scrapy while expecting that the crawl will stop/resume multiple times (e.g., when crawling a very big website)?

like image 781
Abdul Avatar asked Mar 04 '15 10:03

Abdul


2 Answers

For being able to pause and resume the scrapy search, you can run this command for starting the search:

scrapy crawl somespider --set JOBDIR=crawl1

for stopping the search you should run control-C, but you have to run it just once and wait for scrapy to stop, if you run control-C twice it wont work properly.

then you can resume your search by running this command again:

scrapy crawl somespider --set JOBDIR=crawl1
like image 161
Maryam Homayouni Avatar answered Nov 07 '22 21:11

Maryam Homayouni


The version of scrapy that I'm using is 1.1.0

you need to set the correct JOBDIR in settings.py

JOBDIR = 'PROJECT_DIR'

After stoping spider by control+c, you can run the spider to continue scraping the rest again.

It should work after that

like image 33
Boseam Avatar answered Nov 07 '22 19:11

Boseam