How to 'pause' a spider in Scrapy?

Question

I'm using Tor (through Privoxy) for a scraping project, and would like to write a Scrapy extension (cf. https://doc.scrapy.org/en/latest/topics/extensions.html) which requests a new identity (cf. https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor) whenever a certain number of items are scraped.

However, the changing of identity takes some time (a couple of seconds) during which I expect that nothing can be scraped. Therefore, I would like to make the extension 'pause' the spider until the IP change has been completed.

Is this possible? (I have read some solutions about using Cntrl+C and specifying a JOBDIR, but this seems a bit drastic as I only want to pause the spider, and not stop the entire engine).

mizhgun · Accepted Answer

Crawler engine has pause and unpause methods so you can try something like that:

class SomeExtension(object):

   @classmethod
   def from_crawler(cls, crawler)
       o = cls(...)
       o.crawler = crawler
       return o

   def change_tor(self):
       self.crawler.engine.pause()
       # some python code implements changing logic
       self.crawler.engine.unpause()

How to 'pause' a spider in Scrapy?

Tags:

python

scrapy

Kurt Peek

1 Answers

mizhgun

Recent Activity

Donate For Us

How to 'pause' a spider in Scrapy?

Tags:

python

scrapy

Kurt Peek

1 Answers

mizhgun

Related questions

Recent Activity

Donate For Us