Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy. How to yield item after spider_close call?

Tags:

scrapy

I want to yield an item only when the crawling is finished. I am trying to do it via

def spider_closed(self, spider):
    item =  EtsyItem()
    item['total_sales'] = 1111111
    yield item 

But it does not yield anything, though the function is called. How do I yield an item after the scraping is over?

like image 345
Billy Jhon Avatar asked Aug 08 '18 18:08

Billy Jhon


2 Answers

Depending on what you want to do, there might be a veeeery hacky solution for this.

Instead of spider_closed you may want to consider using spider_idle signal which is fired before spider_closed. One difference between idle and close is that spider_idle allows execution of requests which then may contain a callback or errback to yield the desired item.

Inside spider class:

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    # ...
    crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
    return spider

# ...

def yield_item(self, response):
    yield MyItem(name='myname')


def spider_idle(self, spider):
    req = Request('https://fakewebsite123.xyz',
        callback=lambda:None, errback=self.yield_item)
    self.crawler.engine.crawl(req, spider)

However this comes with several side effects so i discourage anyone from using this in production, for example the final request which will raise a DNSLookupError. I just want to show what is possible.

like image 185
nichoio Avatar answered Nov 20 '22 14:11

nichoio


Oof, I'm afraid spider_closed is used for tearing down. I suppose you can do it by attaching some custom stuff to Pipeline to post-process your items.

like image 34
Kevin He Avatar answered Nov 20 '22 13:11

Kevin He