Combine columns from several CSV files into a single file

I have a bunch of CSV files (only two in the example below). Each CSV file has 6 columns. I want to go into each CSV file, copy the first two columns and add them as new columns to an existing CSV file.

Thus far I have:

import csv

f = open('combined.csv')
data = [item for item in csv.reader(f)]
f.close()

for x in range(1,3): #example has 2 csv files, this will be automated
    n=0
    while n<2:
        f=open(str(x)+".csv")
        new_column=[item[n] for item in csv.reader(f)]
        f.close()
        #print d

        new_data = []

        for i, item in enumerate(data):
            try:
                item.append(new_column[i])
                print i
            except IndexError, e:
                item.append("")
            new_data.append(item)

        f = open('combined.csv', 'w')
        csv.writer(f).writerows(new_data)
        f.close()
        n=n+1

This works, it is not pretty, but it works. However, I have three minor annoyances:

I open each CSV file twice (once for each column), that is hardly elegant
When I print the combined.csv file, it prints an empty row following each row?
I have to provide a combined.csv file that has at least as many rows in it as the largest file I may have. Since I do not really know what that number may be, that kinda sucks

As always, any help is much appreciated!!

As requested: 1.csv looks like (mock data)

1,a
2,b
3,c
4,d

2.csv looks like

5,e
6,f
7,g
8,h
9,i

the combined.csv file should look like

1,a,5,e
2,b,6,f
3,c,7,g
4,d,8,h
,,9,i

Can you combine CSV files in Excel?

Type “copy *. csv merged-csv-files. csv” in the command line, right after the folder path. Once you finish, press Enter.

How do I combine several Excel files into one?

On the Data tab, under Tools, click Consolidate. In the Function box, click the function that you want Excel to use to consolidate the data. In each source sheet, select your data, and then click Add. The file path is entered in All references.

import csv
import itertools as IT

filenames = ['1.csv', '2.csv']
handles = [open(filename, 'rb') for filename in filenames]    
readers = [csv.reader(f, delimiter=',') for f in handles]
    
with  open('combined.csv', 'wb') as h:
    writer = csv.writer(h, delimiter=',', lineterminator='\n', )
    for rows in IT.izip_longest(*readers, fillvalue=['']*2):
        combined_row = []
        for row in rows:
            row = row[:2] # select the columns you want
            if len(row) == 2:
                combined_row.extend(row)
            else:
                combined_row.extend(['']*2)#This extends two empty columns
        writer.writerow(combined_row)
        
for f in handles:
    f.close()

The line for rows in IT.izip_longest(*readers, fillvalue=['']*2): can be understood with an example:

In [1]: import itertools as IT

In [2]: readers = [(1,2,3), ('a','b','c','d'), (10,20,30,40)]

In [3]: list(IT.izip_longest(readers[0], readers[1], readers[2]))
Out[3]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]

As you can see, IT.izip_longest behaves very much like zip, except that it does not stop until the longest iterable is consumed. It fills in missing items with None by default.

Now what happens if there were more than 3 items in readers? We would want to write

list(IT.izip_longest(readers[0], readers[1], readers[2], ...))

but that's laborious and if we did not know len(readers) in advance, we wouldn't even be able to replace the ellipsis (...) with something explicit.

Python has a solution for this: the star (aka argument unpacking) syntax:

In [4]: list(IT.izip_longest(*readers))
Out[4]: [(1, 'a', 10), (2, 'b', 20), (3, 'c', 30), (None, 'd', 40)]

Notice the result Out[4] is identical to the result Out[3].

The *readers tells Python to unpack the items in readers and send them along as individual arguments to IT.izip_longest. This is how Python allows us to send an arbitrary number of arguments to a function.

These days it seems almost obligatory for someone to give a pandas-based solution to any data processing problem in Python. So here's mine:

import pandas as pd

to_merge = ['{}.csv'.format(i) for i in range(4)]
dfs = []
for filename in to_merge:
    # read the csv, making sure the first two columns are str
    df = pd.read_csv(filename, header=None, converters={0: str, 1: str})
    # throw away all but the first two columns
    df = df.ix[:,:1]
    # change the column names so they won't collide during concatenation
    df.columns = [filename + str(cname) for cname in df.columns]
    dfs.append(df)

# concatenate them horizontally
merged = pd.concat(dfs,axis=1)
# write it out
merged.to_csv("merged.csv", header=None, index=None)

which for the files

~/coding/pand/merge$ cat 0.csv 
0,a,6,5,3,7
~/coding/pand/merge$ cat 1.csv 
1,b,7,6,7,0
2,c,0,1,8,7
3,d,6,8,4,5
4,e,8,4,2,4
~/coding/pand/merge$ cat 2.csv 
5,f,6,2,9,1
6,g,0,3,2,7
7,h,6,5,1,9
~/coding/pand/merge$ cat 3.csv 
8,i,9,1,7,1
9,j,0,9,3,9

gives

In [21]: !cat merged.csv
0,a,1,b,5,f,8,i
,,2,c,6,g,9,j
,,3,d,7,h,,
,,4,e,,,,

In [22]: pd.read_csv("merged.csv", header=None)
Out[22]: 
    0    1  2  3   4    5   6    7
0   0    a  1  b   5    f   8    i
1 NaN  NaN  2  c   6    g   9    j
2 NaN  NaN  3  d   7    h NaN  NaN
3 NaN  NaN  4  e NaN  NaN NaN  NaN

which I think is the right alignment.

Combine columns from several CSV files into a single file

Tags:

python

csv

Stylize

People also ask

2 Answers

unutbu

DSM

Recent Activity

Donate For Us

Combine columns from several CSV files into a single file

Tags:

python

csv

Stylize

People also ask

2 Answers

unutbu

DSM

Related questions

Recent Activity

Donate For Us