I am attempting to merge two CSV files based on a specific field in each file.
file1.csv
id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"
file2.csv
id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False
This is the code I am using:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
fields2 = next(reader,None) # Skip headers
dict2 = {row[0]: row[1:] for row in reader}
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
fields1 = next(reader,None) # Skip headers
dict1 = OrderedDict((row[0], row[1:]) for row in reader)
result = OrderedDict()
for d in (dict1, dict2):
for key, value in d.iteritems():
result.setdefault(key, []).extend(value)
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for key, value in result.iteritems():
w.writerow([key] + value)
I get output like this, which merges appropriately, but does not have the same number of attributes for all rows:
1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure
file2
will not have a record for every id
in file1
. I'd like the output to have empty fields from file2
in the merged file. For example, id
1 would look like this:
1,True,7,Purple,,,
How can I add the empty fields to records that don't have data in file2
so that all of my records in the merged CSV have the same number of attributes?
If we're not using pandas
, I'd refactor to something like
import csv
from collections import OrderedDict
filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
with open(filename, "rb") as fp: # python 2
reader = csv.DictReader(fp)
fieldnames.extend(reader.fieldnames)
for row in reader:
data.setdefault(row["id"], {}).update(row)
fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
writer = csv.writer(fp)
writer.writerow(fieldnames)
for row in data.itervalues():
writer.writerow([row.get(field, '') for field in fieldnames])
which gives
id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,
For comparison, the pandas
equivalent would be something like
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)
which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With