Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: How to manually insert a request from a spider_idle event callback?

Tags:

python

scrapy

I've created a spider, and have linked a method to the spider_idle event.

How do I add a request manually? I can't just return the item from parse -- parse is not running in this case, as all known URLs have been parsed. I have a method to generate new requests, and I would like to run it from the spider_idle callback to add the created request(s).

class FooSpider(BaseSpider):
    name = 'foo'

    def __init__(self):
        dispatcher.connect(self.dont_close_me, signals.spider_idle)

    def dont_close_me(self, spider):
        if spider != self:
            return
        # The engine instance will allow me to schedule requests, but
        # how do I get the engine object?
        engine = unknown_get_engine()
        engine.schedule(self.create_request())

        # afterward, ensure we stay alive by raising DontCloseSpider
        raise DontCloseSpider("..I prefer live spiders.")

UPDATE: I've determined that I probably need the ExecutionEngine object, but I don't exactly know how to get that from a spider, though it available from a Crawler instance.

UPDATE 2: ..thanks. ..crawler is attached as a property of the superclass, so I can just use self.crawler with no additional effort. >.>

like image 241
Mr. B Avatar asked Jun 06 '13 19:06

Mr. B


People also ask

How do you use Scrapy request?

Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.

How do I make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.

How do you pass meta in Scrapy?

Essentially, I had to connect to the database, get the url and product_id then scrape the URL while passing its product id. All these had to be done in start_requests because that is the function scrapy invokes to request urls. This function has to return a Request object.

How do you add a header on Scrapy?

You need to set the user agent which Scrapy allows you to do directly. import scrapy class QuotesSpider(scrapy. Spider): # ... user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.


1 Answers

class FooSpider(BaseSpider):
    def __init__(self, *args, **kwargs):
        super(FooSpider, self).__init__(*args, **kwargs)
        dispatcher.connect(self.dont_close_me, signals.spider_idle)

    def dont_close_me(self, spider):
        if spider != self:
            return

        self.crawler.engine.crawl(self.create_request(), spider)

        raise DontCloseSpider("..I prefer live spiders.")

Update 2016:

class FooSpider(BaseSpider):
    yet = False

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        from_crawler = super(FooSpider, cls).from_crawler
        spider = from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
        return spider

    def idle(self):
        if not self.yet:
            self.crawler.engine.crawl(self.create_request(), self)
            self.yet = True
like image 143
Steven Almeroth Avatar answered Nov 16 '22 00:11

Steven Almeroth