I'm scraping a website to export the data into a semantic format (n3). However, I also want to perform some data analysis on that data, so having it in a csv format is more convenient.
To get the data in both formats I can do
scrapy spider -t n3 -o data.n3
scrapy spider -t csv -o data.csv
However, this scrapes the data twice and I cannot afford it with big amounts of data.
Is there a way to export the same scraped data into multiple formats? (without downloading the data more than once)
I find interesting to have an intermediate representation of the scraped data that could be exported into different formats. But it seems there is no way to do this with scrapy.
From what I understand after exploring the source code and the documentation, -t
option refers to the FEED_FORMAT
setting which cannot have multiple values. Also, the FeedExporter
built-in extension (source) works with a single exporter only.
Actually, think about making a feature request at the Scrapy Issue Tracker.
As more like a workaround, define a pipeline and start exporting with multiple exporters. For example, here is how to export into both CSV and JSON formats:
from collections import defaultdict
from scrapy import signals
from scrapy.exporters import JsonItemExporter, CsvItemExporter
class MyExportPipeline(object):
def __init__(self):
self.files = defaultdict(list)
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
csv_file = open('%s_products.csv' % spider.name, 'w+b')
json_file = open('%s_products.json' % spider.name, 'w+b')
self.files[spider].append(csv_file)
self.files[spider].append(json_file)
self.exporters = [
JsonItemExporter(json_file),
CsvItemExporter(csv_file)
]
for exporter in self.exporters:
exporter.start_exporting()
def spider_closed(self, spider):
for exporter in self.exporters:
exporter.finish_exporting()
files = self.files.pop(spider)
for file in files:
file.close()
def process_item(self, item, spider):
for exporter in self.exporters:
exporter.export_item(item)
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With