I currently have two sets of data files that look like this: File 1: <pre class="prettyprint"><code>test1 ba ab cd dh gf test2 fa ab cd dh gf test3 rt ty er wq ee test4 er rt sf sd sa </code></pre> and in file 2: <pre class="prettyprint"><code>test1 123 344 123 test1 234 567 787 test1 221 344 566 test3 456 121 677 </code></pre> I would like to combine the files based on mathching rows in the first column (so that "tests" match up) like so: <pre class="prettyprint"><code>test1 ba ab cd dh gf 123 344 123 test1 ba ab cd dh gf 234 567 787 test1 ba ab cd dh gf 221 344 566 test3 rt ty er wq ee 456 121 677 </code></pre> I have this Code <pre class="prettyprint"><code>def combineFiles(file1,file2,outfile): def read_file(file): data = {} for line in csv.reader(file): data[line[0]] = line[1:] return data with open(file1, 'r') as f1, open(file2, 'r') as f2: data1 = read_file(f1) data2 = read_file(f2) with open(outfile, 'w') as out: wtr= csv.writer(out) for key in data1.keys(): try: wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key]))) except KeyError: pass </code></pre> However the output ends up looking like this: <pre class="prettyprint"><code>test1 ba ab cd dh gf 123 344 123 test3 er rt sf sd sa 456 121 677 </code></pre> Can anyone help me with how to make the output so that test1 can be printed all three times? Much Appreciated

You might want to give the Pandas library a try; it makes cases like this easy: <pre class="prettyprint"><code>>>> import pandas as pd >>> pd.merge(df1, df2, on='testnum', how='inner') testnum 1_x 2_x 3_x 4 5 1_y 2_y 3_y 0 test1 ba ab cd dh gf 123 344 123 1 test1 ba ab cd dh gf 234 567 787 2 test1 ba ab cd dh gf 221 344 566 3 test3 rt ty er wq ee 456 121 677 </code></pre> <hr> This assumes the test column is named "testnum". <pre class="prettyprint"><code>>>> df1 testnum 1 2 3 4 5 0 test1 ba ab cd dh gf 1 test2 fa ab cd dh gf 2 test3 rt ty er wq ee 3 test4 er rt sf sd sa >>> df2 testnum 1 2 3 0 test1 123 344 123 1 test1 234 567 787 2 test1 221 344 566 3 test3 456 121 677 </code></pre> You'd read these in with <code>pd.read_csv()</code>.

While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code. Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the <code>data1</code> dictionary as you go: <pre class="prettyprint"><code>with open(file1, 'r') as f1, open(file2, 'r') as f2: data1 = read_file(f1) with open(outfile, 'w') as out: wtr = csv.writer(out, delimiter=' ') for line in csv.reader(f2, delimiter=' '): # only write if there is a corresponding line in file1 if line[0] in data1: # as you write, get the corresponding file1 data wtr.writerow(line[0:] + data1[line[0]] + line[1:]) </code></pre>

Merging data based on matching first column in Python

I currently have two sets of data files that look like this:

File 1:

test1 ba ab cd dh gf
test2 fa ab cd dh gf
test3 rt ty er wq ee
test4 er rt sf sd sa

and in file 2:

test1 123 344 123
test1 234 567 787
test1 221 344 566
test3 456 121 677

I would like to combine the files based on mathching rows in the first column (so that "tests" match up)

like so:

test1 ba ab cd dh gf 123 344 123
test1 ba ab cd dh gf 234 567 787
test1 ba ab cd dh gf 221 344 566
test3 rt ty er wq ee 456 121 677

I have this Code

def combineFiles(file1,file2,outfile):
      def read_file(file):
         data = {}
         for line in csv.reader(file):
            data[line[0]] = line[1:]
         return data
      with open(file1, 'r') as f1, open(file2, 'r') as f2:
         data1 = read_file(f1)
         data2 = read_file(f2)
         with open(outfile, 'w') as out:
            wtr= csv.writer(out)
            for key in data1.keys():
               try:
                  wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key])))
               except KeyError:
                  pass

However the output ends up looking like this:

test1 ba ab cd dh gf 123 344 123
test3 er rt sf sd sa 456 121 677

Can anyone help me with how to make the output so that test1 can be printed all three times?

Much Appreciated

How do I merge two DataFrames based on a column in Python?

We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of different join types is given in the SQL lesson. You specify the type of join you want using the how parameter.

How do I merge two DataFrames in Pandas based on common column?

To merge two Pandas DataFrame with common column, use the merge() function and set the ON parameter as the column name.

How do I merge data frames to a specific column?

Both dataframes has the different number of values but only common values in both the dataframes are displayed after merge. Example 2: In the resultant dataframe Grade column of df2 is merged with df1 based on key column Name with merge type left i.e. all the values of left dataframe (df1) will be displayed.

You might want to give the Pandas library a try; it makes cases like this easy:

>>> import pandas as pd
>>> pd.merge(df1, df2, on='testnum', how='inner')
  testnum 1_x 2_x 3_x   4   5  1_y  2_y  3_y
0   test1  ba  ab  cd  dh  gf  123  344  123
1   test1  ba  ab  cd  dh  gf  234  567  787
2   test1  ba  ab  cd  dh  gf  221  344  566
3   test3  rt  ty  er  wq  ee  456  121  677

This assumes the test column is named "testnum".

>>> df1
  testnum   1   2   3   4   5
0   test1  ba  ab  cd  dh  gf
1   test2  fa  ab  cd  dh  gf
2   test3  rt  ty  er  wq  ee
3   test4  er  rt  sf  sd  sa

>>> df2
  testnum    1    2    3
0   test1  123  344  123
1   test1  234  567  787
2   test1  221  344  566
3   test3  456  121  677

You'd read these in with pd.read_csv().

While I would recommend Brad Solomon's approach as it's pretty succinct, you just need a small change in your code.

Since your second file is the one that has the "final say", you just need to create a dictionary for the first file. Then you can write the output file as you read from the second file, fetching values from the data1 dictionary as you go:

with open(file1, 'r') as f1, open(file2, 'r') as f2:
    data1 = read_file(f1)
    with open(outfile, 'w') as out:
        wtr = csv.writer(out, delimiter=' ')
        for line in csv.reader(f2, delimiter=' '):
            # only write if there is a corresponding line in file1
            if line[0] in data1:
                # as you write, get the corresponding file1 data
                wtr.writerow(line[0:] + data1[line[0]] + line[1:])

Merging data based on matching first column in Python

Tags:

python

python-3.x

python-2.7

Rk_23

People also ask

2 Answers

Brad Solomon

slider

Recent Activity

Donate For Us

Merging data based on matching first column in Python

Tags:

python

python-3.x

python-2.7

Rk_23

People also ask

2 Answers

Brad Solomon

slider

Related questions

Recent Activity

Donate For Us