I have a bunch of CSV files (only two in the example below). Each CSV file has 6 columns. I want to go into each CSV file, copy the first two columns and add them as new columns to an existing CSV file.
Thus far I have:
import csv
f = open('combined.csv')
data = [item for item in csv.reader(f)]
f.close()
for x in range(1,3): #example has 2 csv files, this will be automated
n=0
while n<2:
f=open(str(x)+".csv")
new_column=[item[n] for item in csv.reader(f)]
f.close()
#print d
new_data = []
for i, item in enumerate(data):
try:
item.append(new_column[i])
print i
except IndexError, e:
item.append("")
new_data.append(item)
f = open('combined.csv', 'w')
csv.writer(f).writerows(new_data)
f.close()
n=n+1
This works, it is not pretty, but it works. However, I have three minor annoyances:
I open each CSV file twice (once for each column), that is hardly elegant
When I print the combined.csv
file, it prints an empty row following each row?
I have to provide a combined.csv
file that has at least as many rows in it as the largest file I may have. Since I do not really know what that number may be, that kinda sucks
As always, any help is much appreciated!!
As requested: 1.csv looks like (mock data)
1,a
2,b
3,c
4,d
2.csv looks like
5,e
6,f
7,g
8,h
9,i
the combined.csv file should look like
1,a,5,e
2,b,6,f
3,c,7,g
4,d,8,h
,,9,i
Type “copy *. csv merged-csv-files. csv” in the command line, right after the folder path. Once you finish, press Enter.
On the Data tab, under Tools, click Consolidate. In the Function box, click the function that you want Excel to use to consolidate the data. In each source sheet, select your data, and then click Add. The file path is entered in All references.
import csv
import itertools as IT
filenames = ['1.csv', '2.csv']
handles = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(f, delimiter=',') for f in handles]
with open('combined.csv', 'wb') as h:
writer = csv.writer(h, delimiter=',', lineterminator='\n', )
for rows in IT.izip_longest(*readers, fillvalue=['']*2):
combined_row = []
for row in rows:
row = row[:2] # select the columns you want
if len(row) == 2:
combined_row.extend(row)
else:
combined_row.extend(['']*2)#This extends two empty columns
writer.writerow(combined_row)
for f in handles:
f.close()
The line for rows in IT.izip_longest(*readers, fillvalue=['']*2):
can be understood with an example:
In [1]: import itertools as IT
In [2]: readers = [(1,2,3), ('a','b','c','d'), (10,20,30,40)]
In [3]: list(IT.izip_longest(readers[0], readers[1], readers[2]))
Out[3]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]
As you can see, IT.izip_longest behaves very much like zip
, except that it does not stop until the longest iterable is consumed. It fills in missing items with None
by default.
Now what happens if there were more than 3 items in readers
?
We would want to write
list(IT.izip_longest(readers[0], readers[1], readers[2], ...))
but that's laborious and if we did not know len(readers)
in advance, we wouldn't even be able to replace the ellipsis (...
) with something explicit.
Python has a solution for this: the star (aka argument unpacking) syntax:
In [4]: list(IT.izip_longest(*readers))
Out[4]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]
Notice the result Out[4]
is identical to the result Out[3]
.
The *readers
tells Python to unpack the items in readers
and send them along as individual arguments to IT.izip_longest
.
This is how Python allows us to send an arbitrary number of arguments to a function.
These days it seems almost obligatory for someone to give a pandas-based solution to any data processing problem in Python. So here's mine:
import pandas as pd
to_merge = ['{}.csv'.format(i) for i in range(4)]
dfs = []
for filename in to_merge:
# read the csv, making sure the first two columns are str
df = pd.read_csv(filename, header=None, converters={0: str, 1: str})
# throw away all but the first two columns
df = df.ix[:,:1]
# change the column names so they won't collide during concatenation
df.columns = [filename + str(cname) for cname in df.columns]
dfs.append(df)
# concatenate them horizontally
merged = pd.concat(dfs,axis=1)
# write it out
merged.to_csv("merged.csv", header=None, index=None)
which for the files
~/coding/pand/merge$ cat 0.csv
0,a,6,5,3,7
~/coding/pand/merge$ cat 1.csv
1,b,7,6,7,0
2,c,0,1,8,7
3,d,6,8,4,5
4,e,8,4,2,4
~/coding/pand/merge$ cat 2.csv
5,f,6,2,9,1
6,g,0,3,2,7
7,h,6,5,1,9
~/coding/pand/merge$ cat 3.csv
8,i,9,1,7,1
9,j,0,9,3,9
gives
In [21]: !cat merged.csv
0,a,1,b,5,f,8,i
,,2,c,6,g,9,j
,,3,d,7,h,,
,,4,e,,,,
In [22]: pd.read_csv("merged.csv", header=None)
Out[22]:
0 1 2 3 4 5 6 7
0 0 a 1 b 5 f 8 i
1 NaN NaN 2 c 6 g 9 j
2 NaN NaN 3 d 7 h NaN NaN
3 NaN NaN 4 e NaN NaN NaN NaN
which I think is the right alignment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With