Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: CSV output without header

Tags:

python

scrapy

When I use the command scrapy crawl <project> -o <filename.csv>, I get the output of my Item dictionary with headers. This is good. However, I would like scrapy to omit headers if the file already exists. Is scrapy capable of doing this or do I need to implement that functionality?

like image 633
drum Avatar asked Feb 09 '23 05:02

drum


2 Answers

EDIT (2022.03.09):

This answer was created in 2015 and it shows solution for older Scrapy

In new Scrapy (2.1+) you can use other answer with 'include_headers_line': False


There is include_headers_line=True in CsvItemExporter but I don't know how to use it directly. http://doc.scrapy.org/en/latest/topics/exporters.html#csvitemexporter

But you can create own exporter with include_headers_line=False in file exporters.py (in the same folder as settings.py and items.py)

from scrapy.exporters import CsvItemExporter


class HeadlessCsvItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):
        kwargs['include_headers_line'] = False
        super(HeadlessCsvItemExporter, self).__init__(*args, **kwargs)

Then you have to set this exporter in settings.py

FEED_EXPORTERS = {
    'csv': 'your_project_name.exporters.HeadlessCsvItemExporter',
}

And now scrapy should write csv file without headers.

scrapy crawl <project> -o <filename.csv>

Or you can set

FEED_EXPORTERS = {
    'headless': 'your_project_name.exporters.HeadlessCsvItemExporter',
}

and get csv without headers only when you use -t headless

scrapy crawl <project> -o <filename.csv> -t headless

ps. don't forget to use your project name in place of your_project_name in setttings.py


EDIT:

Now exporter skips headers only if file is not empty (if file.tell() > 0)

from scrapy.exporters import CsvItemExporter


class HeadlessCsvItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):

        # args[0] is (opened) file handler
        # if file is not empty then skip headers
        if args[0].tell() > 0:
            kwargs['include_headers_line'] = False

        super(HeadlessCsvItemExporter, self).__init__(*args, **kwargs)
like image 101
furas Avatar answered Feb 12 '23 10:02

furas


The following settings.py worked for me.

FEEDS = {
    '<filename.csv>': {
        'format': 'csv',
        'item_export_kwargs': {
           'include_headers_line': False,
        },
    }
}
like image 38
JJ41 Avatar answered Feb 12 '23 10:02

JJ41