Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging data based on matching first column in Python

I currently have two sets of data files that look like this:

File 1:

test1 ba ab cd dh gf
test2 fa ab cd dh gf
test3 rt ty er wq ee
test4 er rt sf sd sa

and in file 2:

test1 123 344 123
test1 234 567 787
test1 221 344 566
test3 456 121 677

I would like to combine the files based on mathching rows in the first column (so that "tests" match up)

like so:

test1 ba ab cd dh gf 123 344 123
test1 ba ab cd dh gf 234 567 787
test1 ba ab cd dh gf 221 344 566
test3 rt ty er wq ee 456 121 677

I have this Code

def combineFiles(file1,file2,outfile):
      def read_file(file):
         data = {}
         for line in csv.reader(file):
            data[line[0]] = line[1:]
         return data
      with open(file1, 'r') as f1, open(file2, 'r') as f2:
         data1 = read_file(f1)
         data2 = read_file(f2)
         with open(outfile, 'w') as out:
            wtr= csv.writer(out)
            for key in data1.keys():
               try:
                  wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key])))
               except KeyError:
                  pass

However the output ends up looking like this:

test1 ba ab cd dh gf 123 344 123
test3 er rt sf sd sa 456 121 677

Can anyone help me with how to make the output so that test1 can be printed all three times?

Much Appreciated

like image 948
Rk_23 Avatar asked Nov 14 '18 02:11

Rk_23


People also ask

How do I merge two DataFrames based on a column in Python?

We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of different join types is given in the SQL lesson. You specify the type of join you want using the how parameter.

How do I merge two DataFrames in Pandas based on common column?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.

How do I merge data frames to a specific column?

Both dataframes has the different number of values but only common values in both the dataframes are displayed after merge. Example 2: In the resultant dataframe Grade column of df2 is merged with df1 based on key column Name with merge type left i.e. all the values of left dataframe (df1) will be displayed.


2 Answers

You might want to give the Pandas library a try; it makes cases like this easy:

>>> import pandas as pd
>>> pd.merge(df1, df2, on='testnum', how='inner')
  testnum 1_x 2_x 3_x   4   5  1_y  2_y  3_y
0   test1  ba  ab  cd  dh  gf  123  344  123
1   test1  ba  ab  cd  dh  gf  234  567  787
2   test1  ba  ab  cd  dh  gf  221  344  566
3   test3  rt  ty  er  wq  ee  456  121  677

This assumes the test column is named "testnum".

>>> df1
  testnum   1   2   3   4   5
0   test1  ba  ab  cd  dh  gf
1   test2  fa  ab  cd  dh  gf
2   test3  rt  ty  er  wq  ee
3   test4  er  rt  sf  sd  sa

>>> df2
  testnum    1    2    3
0   test1  123  344  123
1   test1  234  567  787
2   test1  221  344  566
3   test3  456  121  677

You'd read these in with pd.read_csv().

like image 132
Brad Solomon Avatar answered Oct 10 '22 09:10

Brad Solomon


While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code.

Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the data1 dictionary as you go:

with open(file1, 'r') as f1, open(file2, 'r') as f2:
    data1 = read_file(f1)
    with open(outfile, 'w') as out:
        wtr = csv.writer(out, delimiter=' ')
        for line in csv.reader(f2, delimiter=' '):
            # only write if there is a corresponding line in file1
            if line[0] in data1:
                # as you write, get the corresponding file1 data
                wtr.writerow(line[0:] + data1[line[0]] + line[1:])
like image 30
slider Avatar answered Oct 10 '22 10:10

slider