Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Use feed exporter for a particular spider (and not others) in a project

ENVIRONMENT: Windows7, Python 3.6.5, Scrapy 1.5.1

PROBLEM DESCRIPTION:

I have a scrapy project called project_github, which contains 3 spiders:spider1, spider2, spider3. Each of these spiders scrapes data from a particular website individual to that spider.

I am trying to automatically export a JSON file when a particular spider is executed, with the format: NameOfSpider_TodaysDate.json, so that from the command line I can:

Execute the script scrapy crawl spider1 which returns spider1_181115.json

Currently I am using ITEM EXPORTERS in settings.py with the following code:

import datetime
FEED_URI = 'spider1_' + datetime.datetime.today().strftime('%y%m%d') + '.json'
FEED_FORMAT = 'json'
FEED_EXPORTERS = {'json': 'scrapy.exporters.JsonItemExporter'}
FEED_EXPORT_ENCODING = 'utf-8'

Obviously this code always writes spider1_TodaysDate.json regardless of the spider used... Any suggestions?

like image 761
johnnydoe Avatar asked Nov 15 '18 11:11

johnnydoe


People also ask

How do you use multiple spiders in Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

What does parse function do in Scrapy?

The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class. This method, as well as any other Request callback, must return an iterable of Request and/or item objects.

What is Start_urls in Scrapy?

start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.

What is a feed export?

One of the most frequently required features when implementing scrapers is being able to store the scraped data properly and, quite often, that means generating an “export file” with the scraped data (commonly called “export feed”) to be consumed by other systems.


1 Answers

The way to do this is by defining custom_settings as a class attribute under the specific spider were are writing the item exporter for. Spider settings override project settings.

So, for spider1:

class spider1(scrapy.Spider):
    name = "spider1"
    allowed_domains = []

    custom_settings = {
        'FEED_URI': 'spider1_' + datetime.datetime.today().strftime('%y%m%d') + '.json',
        'FEED_FORMAT': 'json',
        'FEED_EXPORTERS': {
            'json': 'scrapy.exporters.JsonItemExporter',
        },
        'FEED_EXPORT_ENCODING': 'utf-8',
    }
like image 101
johnnydoe Avatar answered Sep 19 '22 21:09

johnnydoe