I have multiple big dataframes and I need to get the difference between them, according to specific columns, while keeping the data of the other columns. for example, if I have 2 dataframes as below.
df1:
num Ref Alt DP NL
0 300 C A 50.0 30.0
1 500 C A 45.0 90.0
2 255 A C 69.0 26.0
3 450 B B 55.0 25.0
df2:
num Ref Alt DP NL
0 300 C A 71.0 88.0
1 500 A A 45.0 54.0
2 255 A A 55.0 55.0
3 450 B B 70.0 15.0
I'd like to get the difference between those dataframes according to specific columns ['num','Ref','Alt'] while keeping the data form the remaining columns ['DP', 'NL']. In the example, rows 0 and 3 are the same across the 2 dataframes (since ['num','Ref','Alt'] are the same), so the output dataframes should look something like that:
df1:
num Ref Alt DP NL
0 500 C A 45.0 90.0
2 255 A C 69.0 26.0
df2:
num Ref Alt DP NL
1 500 A A 45.0 54.0
2 255 A A 55.0 55.0
or get it in the same dataframe, if possible...
I've tried to think of few solutions,one is concatenating them and then remove the duplicates, but beacuse they are bif dataframes, it might cause problems,or looping through the dataframes and use the below script:
temp_df = df_list[0]
for df in df_list[1:]:
df_filtered_for_diff = pd.merge(temp_df, df, on=['num', 'Ref', 'Alt'], how='outer')
temp_df = df_filtered_for_diff
return (temp_df)
but I was wondering if there are better solutions?
You can concatenate, then drop duplicates with keep=False:
df = pd.concat([df1, df2])\
.drop_duplicates(subset=['num', 'Ref', 'Alt'], keep=False)
print(df)
num Ref Alt DP NL
1 500 C A 45.0 90.0
2 255 A C 69.0 26.0
1 500 A A 45.0 54.0
2 255 A A 55.0 55.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With