Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge DataFrames in Python without duplicating columns

I am trying to merge several DataFrames based on a common column. This will be done in a loop and the original DataFrame may not have all of the columns so an outer merge will be necessary. However when I do this over several different DataFrames columns duplicate with suffix _x and _y. I am looking for one DataFrame where the data is filled in and columns are added only if they did not previously exists.

df1=pd.DataFrame({'Company Name':['A','B','C','D'],'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
      Company Name  Data1  Data2
0            A      1     13
1            B     34     54
2            C     23   5354
3            D     66    443

A second DataFrame with additional information for some of the companies:

pd.DataFrame({'Company Name':['A','B'],'Address':  ['str1', 'str2'], 'Phone': ['str1a', 'str2a']})

  Company Name Address  Phone
0            A    str1  str1a
1            B    str2  str2a

If I wanted to combine these two it will successfully merge into one using on=Column:

df1=pd.merge(df1,df2, on='Company Name', how='outer')

  Company Name  Data1  Data2 Address  Phone
0            A      1     13    str1  str1a
1            B     34     54    str2  str2a
2            C     23   5354     NaN    NaN
3            D     66    443     NaN    NaN

However if I were to do this same command again in a loop, or if I were to merge with another DataFrame with other company information I end up getting duplicate columns similar to the following:

df1=pd.merge(df1,pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']}), on='Company Name', how='outer')
  Company Name  Data1  Data2 Address_x Phone_x Address_y Phone_y
0            A      1     13      str1   str1a       NaN     NaN
1            B     34     54      str2   str2a       NaN     NaN
2            C     23   5354       NaN     NaN      str3   str3a
3            D     66    443       NaN     NaN       NaN     NaN

When what I really want is one DataFrame with the same columns, just populating any missing data.

  Company Name  Data1  Data2 Address  Phone
0            A      1     13    str1  str1a
1            B     34     54    str2  str2a
2            C     23   5354    str3  str3a
3            D     66    443     NaN    NaN

Thanks in advance. I have reviewed the previous questions asked here on duplicate columns as well as a review of the Pandas documentation with out any progress.

like image 223
Epic_Test Avatar asked Nov 07 '22 22:11

Epic_Test


1 Answers

As you look for merging one dataframe at the time in a loop for, here is a way you can do it, that the new dataframe has new company name or not, new column or not:

df1 = pd.DataFrame({'Company Name':['A','B','C','D'],
                    'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
list_dfo = [pd.DataFrame({'Company Name':['A','B'],
                          'Address':  ['str1', 'str2'], 'Phone': ['str1a', 'str2a']}),
            pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']})]

for df_other in list_dfo:
    df1 = pd.merge(df1,df_other,how='outer').groupby('Company Name').first().reset_index()
    # and other code

At the end in this example:

print(df1)
 Company Name  Data1   Data2 Address  Phone
0            A    1.0    13.0    str1  str1a
1            B   34.0    54.0    str2  str2a
2            C   23.0  5354.0    str3  str3a
3            D   66.0   443.0     NaN    NaN

Instead of first, you can use last, which would keep the last valid value and not the first in each column per group, it depends on what data you need, the one from df1 or the one from df_other if available. In the example above, it does not change anything, but in the following case you will see:

#company A has a new address
df4 = pd.DataFrame({'Company Name':['A'],'Address':['new_str1']})

#first keep the value from df1
print(pd.merge(df1,df4,how='outer').groupby('Company Name')
        .first().reset_index())
Out[21]: 
  Company Name  Data1   Data2 Address  Phone
0            A    1.0    13.0    str1  str1a   #address is str1 from df1
1            B   34.0    54.0    str2  str2a
2            C   23.0  5354.0    str3  str3a
3            D   66.0   443.0     NaN    NaN

#while last keep the value from df4
print (pd.merge(df1,df4,how='outer').groupby('Company Name')
         .last().reset_index())
Out[22]: 
  Company Name  Data1   Data2   Address  Phone
0            A    1.0    13.0  new_str1  str1a   #address is new_str1 from df4
1            B   34.0    54.0      str2  str2a
2            C   23.0  5354.0      str3  str3a
3            D   66.0   443.0       NaN    NaN
like image 98
Ben.T Avatar answered Nov 14 '22 21:11

Ben.T