I'm currently using Scrapy with the following command line arguments:
scrapy crawl my_spider -o data.json
However, I'd prefer to 'save' this command in a Python script. Following https://doc.scrapy.org/en/latest/topics/practices.html, I have the following script:
import scrapy
from scrapy.crawler import CrawlerProcess
from apkmirror_scraper.spiders.sitemap_spider import ApkmirrorSitemapSpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(ApkmirrorSitemapSpider)
process.start() # the script will block here until the crawling is finished
However, it is unclear to me from the documentation what the equivalent of the -o data.json
command line argument should be within the script. How can I make the script generate a JSON file?
Basic ScriptThe key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
Default Structure Scrapy Project cfg - Deploy the configuration file project_name/ - Name of the project _init_.py items.py - It is project's items file pipelines.py - It is project's pipelines file settings.py - It is project's settings file spiders - It is the spiders directory _init_.py spider_name.py . . .
CrawlerProcess . This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands. Here's an example showing how to run a single spider with it. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.
You need to add the FEED_FORMAT
and FEED_URI
to your CrawlerProcess
:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'data.json'
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With