I've created a spider, and have linked a method to the spider_idle event.
How do I add a request manually? I can't just return the item from parse -- parse is not running in this case, as all known URLs have been parsed. I have a method to generate new requests, and I would like to run it from the spider_idle callback to add the created request(s).
class FooSpider(BaseSpider):
name = 'foo'
def __init__(self):
dispatcher.connect(self.dont_close_me, signals.spider_idle)
def dont_close_me(self, spider):
if spider != self:
return
# The engine instance will allow me to schedule requests, but
# how do I get the engine object?
engine = unknown_get_engine()
engine.schedule(self.create_request())
# afterward, ensure we stay alive by raising DontCloseSpider
raise DontCloseSpider("..I prefer live spiders.")
UPDATE: I've determined that I probably need the ExecutionEngine
object, but I don't exactly know how to get that from a spider, though it available from a Crawler
instance.
UPDATE 2: ..thanks. ..crawler is attached as a property of the superclass, so I can just use self.crawler with no additional effort. >.>
Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.
Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.
Essentially, I had to connect to the database, get the url and product_id then scrape the URL while passing its product id. All these had to be done in start_requests because that is the function scrapy invokes to request urls. This function has to return a Request object.
You need to set the user agent which Scrapy allows you to do directly. import scrapy class QuotesSpider(scrapy. Spider): # ... user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.
class FooSpider(BaseSpider):
def __init__(self, *args, **kwargs):
super(FooSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.dont_close_me, signals.spider_idle)
def dont_close_me(self, spider):
if spider != self:
return
self.crawler.engine.crawl(self.create_request(), spider)
raise DontCloseSpider("..I prefer live spiders.")
Update 2016:
class FooSpider(BaseSpider):
yet = False
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
from_crawler = super(FooSpider, cls).from_crawler
spider = from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle, signal=scrapy.signals.spider_idle)
return spider
def idle(self):
if not self.yet:
self.crawler.engine.crawl(self.create_request(), self)
self.yet = True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With