Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to find the complement of two dataframes

given two large dataframes, is there any concise and efficient code (avoid using any for loop directly) that allow me to obtain the complement of these two dataframes?

the most straight forward way to me is to compute union-intersection as shown in the naive example below, but I do not know how to implement this in an elegant languages of pandas or np

df1= pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                   'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})     
df2= pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})        
intersection= pd.merge(df1, df2, how='inner',on=['key1', 'key2'])
union=pd.merge(df1, df2, how='outer',on=['key1', 'key2'])       


complement=union-intersection

thanks for any comments and answers

like image 783
user6651227 Avatar asked Aug 11 '16 21:08

user6651227


People also ask

How do you compare two DataFrames and find the difference?

By using equals() function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not. Unlike dataframe. eq() method, the result of the operation is a scalar boolean value indicating if the dataframe objects are equal or not.

How do you compare two DataFrames equal?

DataFrame - equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.


1 Answers

Starting with this:

df1= pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                   'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})     
df2= pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C': ['C0', 'C1', 'C2', 'C3'],
                      'D': ['D0', 'D1', 'D2', 'D3']})        
intersection  = pd.merge(df1, df2, how='inner',on=['key1', 'key2'])
union         = pd.merge(df1, df2, how='outer',on=['key1', 'key2'])       

print union

     A    B key1 key2    C    D
0   A0   B0   K0   K0   C0   D0
1   A1   B1   K0   K1  NaN  NaN
2   A2   B2   K1   K0   C1   D1
3   A2   B2   K1   K0   C2   D2
4   A3   B3   K2   K1  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3

print intersection

    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
1  A2  B2   K1   K0  C1  D1
2  A2  B2   K1   K0  C2  D2

union-intersection try this:

union[union.isnull().any(axis=1)]

     A    B key1 key2    C    D
1   A1   B1   K0   K1  NaN  NaN
4   A3   B3   K2   K1  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3
like image 182
Merlin Avatar answered Oct 04 '22 19:10

Merlin