Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Export scraping data in multiple formats using scrapy

I'm scraping a website to export the data into a semantic format (n3). However, I also want to perform some data analysis on that data, so having it in a csv format is more convenient.

To get the data in both formats I can do

scrapy spider -t n3 -o data.n3
scrapy spider -t csv -o data.csv

However, this scrapes the data twice and I cannot afford it with big amounts of data.

Is there a way to export the same scraped data into multiple formats? (without downloading the data more than once)

I find interesting to have an intermediate representation of the scraped data that could be exported into different formats. But it seems there is no way to do this with scrapy.

like image 721
kiril Avatar asked Jun 24 '15 16:06

kiril


1 Answers

From what I understand after exploring the source code and the documentation, -t option refers to the FEED_FORMAT setting which cannot have multiple values. Also, the FeedExporter built-in extension (source) works with a single exporter only.

Actually, think about making a feature request at the Scrapy Issue Tracker.

As more like a workaround, define a pipeline and start exporting with multiple exporters. For example, here is how to export into both CSV and JSON formats:

from collections import defaultdict

from scrapy import signals
from scrapy.exporters import JsonItemExporter, CsvItemExporter


class MyExportPipeline(object):
    def __init__(self):
        self.files = defaultdict(list)

     @classmethod
     def from_crawler(cls, crawler):
         pipeline = cls()
         crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
         crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
         return pipeline

    def spider_opened(self, spider):
        csv_file = open('%s_products.csv' % spider.name, 'w+b')
        json_file = open('%s_products.json' % spider.name, 'w+b')

        self.files[spider].append(csv_file)
        self.files[spider].append(json_file)

        self.exporters = [
            JsonItemExporter(json_file),
            CsvItemExporter(csv_file)
        ]

        for exporter in self.exporters:
            exporter.start_exporting()

    def spider_closed(self, spider):
        for exporter in self.exporters:
            exporter.finish_exporting()

        files = self.files.pop(spider)
        for file in files:
            file.close()

    def process_item(self, item, spider):
        for exporter in self.exporters:
            exporter.export_item(item)
        return item
like image 128
alecxe Avatar answered Nov 07 '22 00:11

alecxe