How to save the data from a scrapy crawler into a variable?

Question

I'm currently building a web app meant to display the data collected by a scrapy spider. The user makes a request, the spider crawl a website, then return the data to the app in order to be prompted. I'd like to retrieve the data directly from the scraper, without relying on an intermediary .csv or .json file. Something like :

from scrapy.crawler import CrawlerProcess
from scraper.spiders import MySpider

url = 'www.example.com'
spider = MySpider()
crawler = CrawlerProcess()
crawler.crawl(spider, start_urls=[url])
crawler.start()
data = crawler.data # this bit

hussein13 · Accepted Answer

you can pass the variable as an attribute of the class and store the data in it.

of curse you need to add the attribute in the __init__ method of you spider class.

from scrapy.crawler import CrawlerProcess
from scraper.spiders import MySpider

url = 'www.example.com'
spider = MySpider()
crawler = CrawlerProcess()
data = []
crawler.crawl(spider, start_urls=[url], data)
crawler.start()
print(data)

Siddhant · Answer

This is probably too late but it may help others, you can pass a callback function to the Spider and call that function to return your data like so:

The dummy spider that we are going to use:

class Trial(Spider):
    name = 'trial'

    start_urls = ['']

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.output_callback = kwargs.get('args').get('callback')

    def parse(self, response):
        pass

    def close(self, spider, reason):
        self.output_callback(['Hi, This is the output.'])

A custom class with the callback:

from scrapy.crawler import CrawlerProcess
from scrapyapp.spiders.trial_spider import Trial


class CustomCrawler:

    def __init__(self):
        self.output = None
        self.process = CrawlerProcess(settings={'LOG_ENABLED': False})

    def yield_output(self, data):
        self.output = data

    def crawl(self, cls):
        self.process.crawl(cls, args={'callback': self.yield_output})
        self.process.start()


def crawl_static(cls):
    crawler = CustomCrawler()
    crawler.crawl(cls)
    return crawler.output

Then you can do:

out = crawl_static(Trial)
print(out)

How to save the data from a scrapy crawler into a variable?

Tags:

python

scrapy

Crolle

2 Answers

hussein13

Siddhant

Recent Activity

Donate For Us

How to save the data from a scrapy crawler into a variable?

Tags:

python

scrapy

Crolle

2 Answers

hussein13

Siddhant

Related questions

Recent Activity

Donate For Us