I have made a simple Scrapy spider that I use from the command line to export my data into the CSV format, but the order of the data seem random. How can I order the CSV fields in my output?
I use the following command line to get CSV data:
scrapy crawl somwehere -o items.csv -t csv
According to this Scrapy documentation, I should be able to use the fields_to_export
attribute of the BaseItemExporter
class to control the order. But I am clueless how to use this as I have not found any simple example to follow.
Please Note: This question is very similar to THIS one. However, that question is over 2 years old and doesn't address the many recent changes to Scrapy and neither provides a satisfactory answer, as it requires hacking one or both of:
to address some previous issues, that seem to have already been resolved...
Many thanks in advance.
To use such exporter you need to create your own Item pipeline that will process your spider output. Assuming that you have simple case and you want to have all spider output in one file this is pipeline you should use (pipelines.py
):
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class CSVPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('%s_items.csv' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = [list with Names of fields to export - order is important]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Of course you need to remember to add this pipeline in your configuration file (settings.py
):
ITEM_PIPELINES = {'myproject.pipelines.CSVPipeline': 300 }
You can now specify settings in the spider itself. https://doc.scrapy.org/en/latest/topics/settings.html#settings-per-spider
To set the field order for exported feeds, set FEED_EXPORT_FIELDS
.
https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-fields
The spider below dumps all links on a website (written against Scrapy 1.4.0):
import scrapy
from scrapy.http import HtmlResponse
class DumplinksSpider(scrapy.Spider):
name = 'dumplinks'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/']
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["page", "page_ix", "text", "url"],
}
def parse(self, response):
if not isinstance(response, HtmlResponse):
return
a_selectors = response.xpath('//a')
for i, a_selector in enumerate(a_selectors):
text = a_selector.xpath('normalize-space(text())').extract_first()
url = a_selector.xpath('@href').extract_first()
yield {
'page_ix': i + 1,
'page': response.url,
'text': text,
'url': url,
}
yield response.follow(url, callback=self.parse) # see allowed_domains
Run with this command:
scrapy crawl dumplinks --loglevel=INFO -o links.csv
Fields in links.csv
are ordered as specified by FEED_EXPORT_FIELDS
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With