Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle csv file with duplicate fieldnames when reading with csv.DictReader?

Tags:

python

csv

I am working with a poorly-formed CSV file; it has duplicate fieldnames.

csv.DictReader just overwrites the first column with the same name with the contents of the second column with the same name. But I need both contents of columns with duplicate name.

I can't assign the DictReader.fieldnames parameter directly. There are about one hundred columns and every time it would be different number of columns, e.g.:

product, price1, price2, price1,...,price100
car, 100, 300, 200,...,350

output: {'product':'car', 'price1': 200, 'price2':300}

I need: {'product':'car', 'price1': 100, 'price2':300, 'price3': 200}

What is the way to do it?

like image 271
Eugene Alkhouski Avatar asked Aug 02 '15 11:08

Eugene Alkhouski


People also ask

What is the difference between csv reader and csv DictReader?

csv. Reader() allows you to access CSV data using indexes and is ideal for simple CSV files. csv. DictReader() on the other hand is friendlier and easy to use, especially when working with large CSV files.

How do I read a CSV file twice?

If you need to re-read the file, you can either close it and re-open it, or seek() to the beginning, i.e. add ordersFile. seek(0) between your loops.

Can csv have duplicate headers?

Because the csv document is treated as tabular data the header can not contain duplicate entries. If the header contains duplicates an exception will be thrown on usage.


1 Answers

Don't use a DictReader() in this case. Stick to a regular reader instead.

You can always map to a dictionary based on a re-mapped list of fieldnames:

with open(filename, 'rb') as csvfile:
    reader = csv.reader(csvfile)
    fieldnames = remap(next(reader))
    for row in reader:
        row = dict(zip(fieldnames, row))

where the remap() function could either renumber your numbered columns or append extra information if column names are duplicated.

Re-numbering could be as easy as:

from itertools import count

def remap(fieldnames):
    price_count = count(1)
    return ['price{}'.format(next(price_count)) if f.startswith('price') else f
            for f in fieldnames]
like image 156
Martijn Pieters Avatar answered Oct 30 '22 12:10

Martijn Pieters