ENVIRONMENT: Windows7, Python 3.6.5, Scrapy 1.5.1
PROBLEM DESCRIPTION:
I have a scrapy project called project_github
, which contains 3 spiders:spider1
, spider2
, spider3
. Each of these spiders scrapes data from a particular website individual to that spider.
I am trying to automatically export a JSON file when a particular spider is executed, with the format: NameOfSpider_TodaysDate.json
, so that from the command line I can:
Execute the script scrapy crawl spider1
which returns spider1_181115.json
Currently I am using ITEM EXPORTERS
in settings.py
with the following code:
import datetime
FEED_URI = 'spider1_' + datetime.datetime.today().strftime('%y%m%d') + '.json'
FEED_FORMAT = 'json'
FEED_EXPORTERS = {'json': 'scrapy.exporters.JsonItemExporter'}
FEED_EXPORT_ENCODING = 'utf-8'
Obviously this code always writes spider1_TodaysDate.json
regardless of the spider used... Any suggestions?
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class. This method, as well as any other Request callback, must return an iterable of Request and/or item objects.
start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.
One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating an “export file” with the scraped data (commonly called “export feed”) to be consumed by other systems.
The way to do this is by defining custom_settings
as a class
attribute under the specific spider were are writing the item exporter for. Spider settings override project settings.
So, for spider1
:
class spider1(scrapy.Spider):
name = "spider1"
allowed_domains = []
custom_settings = {
'FEED_URI': 'spider1_' + datetime.datetime.today().strftime('%y%m%d') + '.json',
'FEED_FORMAT': 'json',
'FEED_EXPORTERS': {
'json': 'scrapy.exporters.JsonItemExporter',
},
'FEED_EXPORT_ENCODING': 'utf-8',
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With