Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Union of two pandas DataFrames

Say I have two data frames:

df1:

  A
0 a
1 b

df2:

  A
0 a
1 c

I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:

  A  B
0 a  df1, df2
1 b  df1
2 c  df2

I can get the concatenated data frame (df3) without duplicates as follows:

import pandas as pd
df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)

I can't think of/find a method to have control over what element goes where. How can I add the extra column?

Thank you very much for any tips.

like image 676
Leon Rai Avatar asked Jan 22 '19 19:01

Leon Rai


2 Answers

Merge with an indicator argument, and remap the result:

m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}

result = df1.merge(df2, on=['A'], how='outer', indicator='B')
result['B'] = result['B'].map(m)

result
   A         B
0  a  df1, df2
1  b       df1
2  c       df2
like image 152
cs95 Avatar answered Sep 27 '22 23:09

cs95


We use outer join to solve this -

df1 = pd.DataFrame({'A':['a','b']})
df2 = pd.DataFrame({'A':['a','c']})
df1['col1']='df1'
df2['col2']='df2'
df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
df['B']=df['col1']+','+df['col2']
df['B'] = df['B'].str.strip(',')
df=df[['A','B']]
df

   A        B
0  a  df1,df2
1  b      df1
2  c      df2
like image 36
cph_sto Avatar answered Sep 27 '22 23:09

cph_sto