Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: non-blocking pause

Tags:

I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause.

It's looks like:

class ScrapySpider(Spider):     name = 'live_function'      def start_requests(self):         yield Request('some url', callback=self.non_stop_function)      def non_stop_function(self, response):         for url in ['url1', 'url2', 'url3', 'more urls']:             yield Request(url, callback=self.second_parse_function)          # Here I need some function for sleep only this function like time.sleep(10)          yield Request('some url', callback=self.non_stop_function)  # Call itself      def second_parse_function(self, response):         pass 

Function non_stop_function needs to be stopped for a while, but it should not block the rest of the output.

If I insert time.sleep() - it will stop the whole parser, but I don't need it. Is it possible to stop one function using twisted or something else?

Reason: I need to create a non-blocking function that will parse the page of the website every n seconds. There she will get urls and fill for 10 seconds. URLs that have been obtained will continue to work, but the main feature needs to sleep.

UPDATE:

Thanks to TkTech and viach. One answer helped me to understand how to make a pending Request, and the second is how to activate it. Both answers complement each other and I made an excellent non-blocking pause for Scrapy:

def call_after_pause(self, response):     d = Deferred()     reactor.callLater(10.0, d.callback, Request(         'https://example.com/',         callback=self.non_stop_function,         dont_filter=True))     return d 

And use this function for my request:

yield Request('https://example.com/', callback=self.call_after_pause, dont_filter=True) 
like image 341
JRazor Avatar asked May 02 '16 14:05

JRazor


2 Answers

Request object has callback parameter, try to use that one for the purpose. I mean, create a Deferred which wraps self.second_parse_function and pause.

Here is my dirty and not tested example, changed lines are marked.

class ScrapySpider(Spider):     name = 'live_function'      def start_requests(self):         yield Request('some url', callback=self.non_stop_function)      def non_stop_function(self, response):          parse_and_pause = Deferred()  # changed         parse_and_pause.addCallback(self.second_parse_function) # changed         parse_and_pause.addCallback(pause, seconds=10)  # changed          for url in ['url1', 'url2', 'url3', 'more urls']:             yield Request(url, callback=parse_and_pause)  # changed          yield Request('some url', callback=self.non_stop_function)  # Call itself      def second_parse_function(self, response):         pass 

If the approach works for you then you can create a function which constructs a Deferred object according to the rule. It could be implemented in the way like the following:

def get_perform_and_pause_deferred(seconds, fn, *args, **kwargs):     d = Deferred()     d.addCallback(fn, *args, **kwargs)     d.addCallback(pause, seconds=seconds)     return d 

And here is possible usage:

class ScrapySpider(Spider):     name = 'live_function'      def start_requests(self):         yield Request('some url', callback=self.non_stop_function)      def non_stop_function(self, response):         for url in ['url1', 'url2', 'url3', 'more urls']:             # changed             yield Request(url, callback=get_perform_and_pause_deferred(10, self.second_parse_function))          yield Request('some url', callback=self.non_stop_function)  # Call itself      def second_parse_function(self, response):         pass 
like image 84
Viach Kakovskyi Avatar answered Sep 22 '22 09:09

Viach Kakovskyi


If you're attempting to use this for rate limiting, you probably just want to use DOWNLOAD_DELAY instead.

Scrapy is just a framework on top of Twisted. For the most part, you can treat it the same as any other twisted app. Instead of calling sleep, just return the next request to make and tell twisted to wait a bit. Ex:

from twisted.internet import reactor, defer  def non_stop_function(self, response)     d = defer.Deferred()     reactor.callLater(10.0, d.callback, Request(         'some url',         callback=self.non_stop_function     ))     return d 
like image 20
TkTech Avatar answered Sep 22 '22 09:09

TkTech