Considering I have 2 dataframes as shown below (DF1 and DF2), I need to compare DF2 with DF1 such that I can identify all the Matching, Different, Missing values for all the columns in DF2 that match columns in DF1 (Col1, Col2 & Col3 in this case) for rows with same EID value (A, B, C & D). I do not wish to iterate on each row of a dataframe as it can be time-consuming.
Note: There can around 70 - 100 columns. This is just a sample dataframe I am using.
DF1
EID Col1 Col2 Col3 Col4
0 A a1 b1 c1 d1
1 B a2 b2 c2 d2
2 C None b3 c3 d3
3 D a4 b4 c4 d4
4 G a5 b5 c5 d5
DF2
EID Col1 Col2 Col3
0 A a1 b1 c1
1 B a2 b2 c9
2 C a3 b3 c3
3 D a4 b4 None
Expected output dataframe
EID Col1 Col2 Col3 New_Col
0 A a1 b1 c1 Match
1 B a2 b2 c2 Different
2 C None b3 c3 Missing in DF1
3 D a4 b4 c4 Missing in DF2
Firstly, you will need to filter df1 based on df2.
new_df = df1.loc[df1['EID'].isin(df2['EID']), df2.columns]
EID Col1 Col2 Col3
0 A a1 b1 c1
1 B a2 b2 c2
2 C None b3 c3
3 D a4 b4 c4
Next, since you have a big dataframe to compare, you can change both the new_df and df2 to numpy arrays.
array1 = new_df.to_numpy()
array2 = df2.to_numpy()
Now you can compare it row-wise using np.where
new_df['New Col'] = np.where((array1 == array2).all(axis=1),'Match', 'Different')
EID Col1 Col2 Col3 New Col
0 A a1 b1 c1 Match
1 B a2 b2 c2 Different
2 C None b3 c3 Different
3 D a4 b4 c4 Different
Finally, to convert the row with None value, you can use df.loc and df.isnull
new_df.loc[new_df.isnull().any(axis=1), ['New Col']] = 'Missing in DF1'
new_df.loc[df2.isnull().any(axis=1), ['New Col']] = 'Missing in DF2'
EID Col1 Col2 Col3 New Col
0 A a1 b1 c1 Match
1 B a2 b2 c2 Different
2 C None b3 c3 Missing in DF1
3 D a4 b4 c4 Missing in DF2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With