I'm using Tor (through Privoxy) for a scraping project, and would like to write a Scrapy extension (cf. https://doc.scrapy.org/en/latest/topics/extensions.html) which requests a new identity (cf. https://stem.torproject.org/faq.html#how-do-i-request-a-new-identity-from-tor) whenever a certain number of items are scraped.
However, the changing of identity takes some time (a couple of seconds) during which I expect that nothing can be scraped. Therefore, I would like to make the extension 'pause' the spider until the IP change has been completed.
Is this possible? (I have read some solutions about using Cntrl+C and specifying a JOBDIR
, but this seems a bit drastic as I only want to pause the spider, and not stop the entire engine).
Crawler engine has pause
and unpause
methods so you can try something like that:
class SomeExtension(object):
@classmethod
def from_crawler(cls, crawler)
o = cls(...)
o.crawler = crawler
return o
def change_tor(self):
self.crawler.engine.pause()
# some python code implements changing logic
self.crawler.engine.unpause()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With