Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging multiple pandas dataframes without duplicating columns

I have a large class of students separated into sections with each student havinga unique ID. I have the entire roster stored in a dataframe. I also have multiple dataframes representing the grades from a particular section of students on a particular assignment. I would like to merge all of this information into a single dataframe that represents my gradebook. For example:

import pandas as pd

# Initialize roster
data = [['ab10', 'Ann Big'], ['ca9', 'Carl Ahn'], ['jb19', 'John Brown'], ['cf25', 'Carol Fox']]
roster = pd.DataFrame(data, columns = ['ID', 'Name'])

# Initialize the section grades
data = [['ab10', 95], ['ca9', 72]]
grades0 = pd.DataFrame(data, columns = ['ID', 'Exp1'])

data = [['ab10', 83], ['ca9', 97]]
grades1 = pd.DataFrame(data, columns = ['ID', 'Exp2'])

data = [['jb19', 61], ['cf25', 95]]
grades2 = pd.DataFrame(data, columns = ['ID', 'Exp1'])

# Now merge the section grades with the roster to generate final gradebook
roster = roster.merge(grades0, on = 'ID', how = 'outer')
roster = roster.merge(grades1, on = 'ID', how = 'outer')
roster = roster.merge(grades2, on = 'ID', how = 'outer')

print(roster)

This code generates the following:

     ID        Name  Exp1_x  Exp2  Exp1_y
0  ab10     Ann Big    95.0  83.0     NaN
1   ca9    Carl Ahn    72.0  97.0     NaN
2  jb19  John Brown     NaN   NaN    61.0
3  cf25   Carol Fox     NaN   NaN    95.0

I don't want the duplicated Exp1 columns with the suffixes _x and _y. Instead I want:

     ID        Name    Exp1  Exp2
0  ab10     Ann Big    95.0  83.0 
1   ca9    Carl Ahn    72.0  97.0
2  jb19  John Brown    61.0   NaN
3  cf25   Carol Fox    95.0   NaN

There should be no duplicated data between the grade dataframes (but it would be good practice to raise an error were an overlap to exist).

like image 820
Melissa Avatar asked Apr 02 '26 11:04

Melissa


1 Answers

reduce with combine_first

As there are is no duplication between the grades dataframes, we can therefore reduce with combine_first to combine all the the dataframes together

from functools import reduce

reduce(pd.DataFrame.combine_first, 
      [g.set_index('ID') for g in (roster, grades0, grades1, grades2)])

      Exp1  Exp2        Name
ID                          
ab10  95.0  83.0     Ann Big
ca9   72.0  97.0    Carl Ahn
cf25  95.0   NaN   Carol Fox
jb19  61.0   NaN  John Brown
like image 78
Shubham Sharma Avatar answered Apr 08 '26 07:04

Shubham Sharma



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!