How do I merge two CSV files based on field and keep same number of attributes on each record?

Question

I am attempting to merge two CSV files based on a specific field in each file.

file1.csv

id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"

file2.csv

id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False

This is the code I am using:

import csv
from collections import OrderedDict

with open('file2.csv','r') as f2:
    reader = csv.reader(f2)
    fields2 = next(reader,None) # Skip headers
    dict2 = {row[0]: row[1:] for row in reader}

with open('file1.csv','r') as f1:
    reader = csv.reader(f1)
    fields1 = next(reader,None) # Skip headers
    dict1 = OrderedDict((row[0], row[1:]) for row in reader)

result = OrderedDict()
for d in (dict1, dict2):
    for key, value in d.iteritems():
        result.setdefault(key, []).extend(value)

with open('merged.csv', 'wb') as f:
    w = csv.writer(f)
    for key, value in result.iteritems():
        w.writerow([key] + value)

I get output like this, which merges appropriately, but does not have the same number of attributes for all rows:

1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure

file2 will not have a record for every id in file1. I'd like the output to have empty fields from file2 in the merged file. For example, id 1 would look like this:

1,True,7,Purple,,,

How can I add the empty fields to records that don't have data in file2 so that all of my records in the merged CSV have the same number of attributes?

DSM · Accepted Answer

If we're not using pandas, I'd refactor to something like

import csv
from collections import OrderedDict

filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
    with open(filename, "rb") as fp: # python 2
        reader = csv.DictReader(fp)
        fieldnames.extend(reader.fieldnames)
        for row in reader:
            data.setdefault(row["id"], {}).update(row)

fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
    writer = csv.writer(fp)
    writer.writerow(fieldnames)
    for row in data.itervalues():
        writer.writerow([row.get(field, '') for field in fieldnames])

which gives

id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,

For comparison, the pandas equivalent would be something like

df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)

which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.

How do I merge two CSV files based on field and keep same number of attributes on each record?

Tags:

python

merge

csv

Andy

1 Answers

DSM

Recent Activity

Donate For Us

How do I merge two CSV files based on field and keep same number of attributes on each record?

Tags:

python

merge

csv

Andy

1 Answers

DSM

Related questions

Recent Activity

Donate For Us