I currently have two sets of data files that look like this:
File 1:
test1 ba ab cd dh gf
test2 fa ab cd dh gf
test3 rt ty er wq ee
test4 er rt sf sd sa
and in file 2:
test1 123 344 123
test1 234 567 787
test1 221 344 566
test3 456 121 677
I would like to combine the files based on mathching rows in the first column (so that "tests" match up)
like so:
test1 ba ab cd dh gf 123 344 123
test1 ba ab cd dh gf 234 567 787
test1 ba ab cd dh gf 221 344 566
test3 rt ty er wq ee 456 121 677
I have this Code
def combineFiles(file1,file2,outfile):
def read_file(file):
data = {}
for line in csv.reader(file):
data[line[0]] = line[1:]
return data
with open(file1, 'r') as f1, open(file2, 'r') as f2:
data1 = read_file(f1)
data2 = read_file(f2)
with open(outfile, 'w') as out:
wtr= csv.writer(out)
for key in data1.keys():
try:
wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key])))
except KeyError:
pass
However the output ends up looking like this:
test1 ba ab cd dh gf 123 344 123
test3 er rt sf sd sa 456 121 677
Can anyone help me with how to make the output so that test1 can be printed all three times?
Much Appreciated
We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of different join types is given in the SQL lesson. You specify the type of join you want using the how parameter.
To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.
Both dataframes has the different number of values but only common values in both the dataframes are displayed after merge. Example 2: In the resultant dataframe Grade column of df2 is merged with df1 based on key column Name with merge type left i.e. all the values of left dataframe (df1) will be displayed.
You might want to give the Pandas library a try; it makes cases like this easy:
>>> import pandas as pd
>>> pd.merge(df1, df2, on='testnum', how='inner')
testnum 1_x 2_x 3_x 4 5 1_y 2_y 3_y
0 test1 ba ab cd dh gf 123 344 123
1 test1 ba ab cd dh gf 234 567 787
2 test1 ba ab cd dh gf 221 344 566
3 test3 rt ty er wq ee 456 121 677
This assumes the test column is named "testnum".
>>> df1
testnum 1 2 3 4 5
0 test1 ba ab cd dh gf
1 test2 fa ab cd dh gf
2 test3 rt ty er wq ee
3 test4 er rt sf sd sa
>>> df2
testnum 1 2 3
0 test1 123 344 123
1 test1 234 567 787
2 test1 221 344 566
3 test3 456 121 677
You'd read these in with pd.read_csv()
.
While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code.
Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the data1
dictionary as you go:
with open(file1, 'r') as f1, open(file2, 'r') as f2:
data1 = read_file(f1)
with open(outfile, 'w') as out:
wtr = csv.writer(out, delimiter=' ')
for line in csv.reader(f2, delimiter=' '):
# only write if there is a corresponding line in file1
if line[0] in data1:
# as you write, get the corresponding file1 data
wtr.writerow(line[0:] + data1[line[0]] + line[1:])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With