Running Scrapy from a script with file output

Tags:

python

scrapy

I'm currently using Scrapy with the following command line arguments:

scrapy crawl my_spider -o data.json

However, I'd prefer to 'save' this command in a Python script. Following https://doc.scrapy.org/en/latest/topics/practices.html, I have the following script:

import scrapy
from scrapy.crawler import CrawlerProcess

from apkmirror_scraper.spiders.sitemap_spider import ApkmirrorSitemapSpider

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ApkmirrorSitemapSpider)
process.start() # the script will block here until the crawling is finished

However, it is unclear to me from the documentation what the equivalent of the -o data.json command line argument should be within the script. How can I make the script generate a JSON file?

501

asked Apr 18 '17 09:04

Kurt Peek

1 Answers

You need to add the FEED_FORMAT and FEED_URI to your CrawlerProcess:

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'data.json'
})

answered Nov 01 '22 13:11

vold

Related questions
                            
                                Does time.sleep help the processor?
                            
                                Are there any examples of anomaly detection algorithms implemented with TensorFlow?
                            
                                inserting numpy integer types into sqlite with python3
                            
                                Passing a command line argument to a py.test fixture as a parameter
                            
                                SQLAlchemy: is it possible to operate Query without bounding to session?
                            
                                limited number of user-initiated background processes
                            
                                pandas, convert DataFrame to MultiIndex'ed DataFrame
                            
                                Saving objects and their related objects at the same time in Django
                            
                                pandas dataframe : add & remove prefix/suffix from all cell values of entire dataframe
                            
                                APScheduler missing jobs after adding misfire_grace_time
                            
                                How to convert a matrix into column array with PANDAS / Python
                            
                                How to calculate perplexity of RNN in tensorflow
                            
                                Calling a parent method from outside the child
                            
                                Adding markers or lines to colorbar in matplotlib
                            
                                How to close web browser using python
                            
                                How do I add cv2 as a requirement in a python package?
                            
                                Regex add character to matched string
                            
                                Why does "pip install" not include my package_data files?
                            
                                ImportError: Missing required dependencies ['numpy']
                            
                                Django Middleware Error - Middleware changed for 1.7

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With