I am trying to merge several DataFrames based on a common column. This will be done in a loop and the original DataFrame may not have all of the columns so an outer merge will be necessary. However when I do this over several different DataFrames columns duplicate with suffix _x and _y. I am looking for one DataFrame where the data is filled in and columns are added only if they did not previously exists.
df1=pd.DataFrame({'Company Name':['A','B','C','D'],'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
Company Name Data1 Data2
0 A 1 13
1 B 34 54
2 C 23 5354
3 D 66 443
A second DataFrame with additional information for some of the companies:
pd.DataFrame({'Company Name':['A','B'],'Address': ['str1', 'str2'], 'Phone': ['str1a', 'str2a']})
Company Name Address Phone
0 A str1 str1a
1 B str2 str2a
If I wanted to combine these two it will successfully merge into one using on=Column:
df1=pd.merge(df1,df2, on='Company Name', how='outer')
Company Name Data1 Data2 Address Phone
0 A 1 13 str1 str1a
1 B 34 54 str2 str2a
2 C 23 5354 NaN NaN
3 D 66 443 NaN NaN
However if I were to do this same command again in a loop, or if I were to merge with another DataFrame with other company information I end up getting duplicate columns similar to the following:
df1=pd.merge(df1,pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']}), on='Company Name', how='outer')
Company Name Data1 Data2 Address_x Phone_x Address_y Phone_y
0 A 1 13 str1 str1a NaN NaN
1 B 34 54 str2 str2a NaN NaN
2 C 23 5354 NaN NaN str3 str3a
3 D 66 443 NaN NaN NaN NaN
When what I really want is one DataFrame with the same columns, just populating any missing data.
Company Name Data1 Data2 Address Phone
0 A 1 13 str1 str1a
1 B 34 54 str2 str2a
2 C 23 5354 str3 str3a
3 D 66 443 NaN NaN
Thanks in advance. I have reviewed the previous questions asked here on duplicate columns as well as a review of the Pandas documentation with out any progress.
As you look for merging one dataframe at the time in a loop for, here is a way you can do it, that the new dataframe has new company name or not, new column or not:
df1 = pd.DataFrame({'Company Name':['A','B','C','D'],
'Data1':[1,34,23,66],'Data2':[13,54,5354,443]})
list_dfo = [pd.DataFrame({'Company Name':['A','B'],
'Address': ['str1', 'str2'], 'Phone': ['str1a', 'str2a']}),
pd.DataFrame({'Company Name':['C'],'Address':['str3'],'Phone':['str3a']})]
for df_other in list_dfo:
df1 = pd.merge(df1,df_other,how='outer').groupby('Company Name').first().reset_index()
# and other code
At the end in this example:
print(df1)
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 str1 str1a
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN
Instead of first
, you can use last
, which would keep the last valid value and not the first in each column per group, it depends on what data you need, the one from df1
or the one from df_other
if available. In the example above, it does not change anything, but in the following case you will see:
#company A has a new address
df4 = pd.DataFrame({'Company Name':['A'],'Address':['new_str1']})
#first keep the value from df1
print(pd.merge(df1,df4,how='outer').groupby('Company Name')
.first().reset_index())
Out[21]:
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 str1 str1a #address is str1 from df1
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN
#while last keep the value from df4
print (pd.merge(df1,df4,how='outer').groupby('Company Name')
.last().reset_index())
Out[22]:
Company Name Data1 Data2 Address Phone
0 A 1.0 13.0 new_str1 str1a #address is new_str1 from df4
1 B 34.0 54.0 str2 str2a
2 C 23.0 5354.0 str3 str3a
3 D 66.0 443.0 NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With