I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())
d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()
def spider_output(output):
# do something to that output
How can i get spider output in 'spider_output' method. It is possible to get output/results.
This is an old question, but for future reference. If you are working with python 3.6+ I recommend using scrapyscript that allows you to run your Spiders and get the results in a super simple way:
from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json
# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
name = 'myspider'
def start_requests(self):
yield Request(self.url)
def parse(self, response):
title = response.xpath('//title/text()').extract()
return {'url': response.request.url, 'title': title}
# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')
# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)
# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])
# Print the consolidated results
print(json.dumps(data, indent=4))
[
{
"title": [
"Welcome to Python.org"
],
"url": "https://www.python.org/"
},
{
"title": [
"The world's leading software development platform \u00b7 GitHub",
"1clr-code-hosting"
],
"url": "https://github.com/"
}
]
Here is the solution that get all output/results in a list
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
def spider_results():
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_scraped)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
return results
if __name__ == '__main__':
print(spider_results())
AFAIK there is no way to do this, since crawl():
Returns a deferred that is fired when the crawling is finished.
And the crawler doesn't store results anywhere other than outputting them to logger.
However returning ouput would conflict with the whole asynchronious nature and structure of scrapy, so saving to file then reading it is a prefered approach here.
You can simply devise pipeline that saves your items to file and simply read the file in your spider_output
. You will receive your results since reactor.run()
is blocking your script untill the output file is complete anyways.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With