Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running Scrapy from a script with file output

Tags:

python

scrapy

I'm currently using Scrapy with the following command line arguments:

scrapy crawl my_spider -o data.json

However, I'd prefer to 'save' this command in a Python script. Following https://doc.scrapy.org/en/latest/topics/practices.html, I have the following script:

import scrapy
from scrapy.crawler import CrawlerProcess

from apkmirror_scraper.spiders.sitemap_spider import ApkmirrorSitemapSpider

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ApkmirrorSitemapSpider)
process.start() # the script will block here until the crawling is finished

However, it is unclear to me from the documentation what the equivalent of the -o data.json command line argument should be within the script. How can I make the script generate a JSON file?

like image 501
Kurt Peek Avatar asked Apr 18 '17 09:04

Kurt Peek


People also ask

How do you run a Scrapy in a script?

Basic ScriptThe key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

How do you run a Scrapy CFG?

Default Structure Scrapy Project cfg - Deploy the configuration file project_name/ - Name of the project _init_.py items.py - It is project's items file pipelines.py - It is project's pipelines file settings.py - It is project's settings file spiders - It is the spiders directory _init_.py spider_name.py . . .

What is CrawlerProcess?

CrawlerProcess . This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands. Here's an example showing how to run a single spider with it. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.


1 Answers

You need to add the FEED_FORMAT and FEED_URI to your CrawlerProcess:

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'data.json'
})
like image 59
vold Avatar answered Nov 01 '22 13:11

vold