Get the difference between dataframes according to specific columns pandas

Question

I have multiple big dataframes and I need to get the difference between them, according to specific columns, while keeping the data of the other columns. for example, if I have 2 dataframes as below.

df1:

   num    Ref   Alt    DP     NL
0  300     C     A    50.0   30.0  
1  500     C     A    45.0   90.0 
2  255     A     C    69.0   26.0
3  450     B     B    55.0   25.0

df2:

   num    Ref   Alt    DP     NL
0  300     C     A    71.0   88.0  
1  500     A     A    45.0   54.0   
2  255     A     A    55.0   55.0
3  450     B     B    70.0   15.0

I'd like to get the difference between those dataframes according to specific columns ['num','Ref','Alt'] while keeping the data form the remaining columns ['DP', 'NL']. In the example, rows 0 and 3 are the same across the 2 dataframes (since ['num','Ref','Alt'] are the same), so the output dataframes should look something like that:

df1:

   num    Ref   Alt     DP     NL
0  500     C     A     45.0   90.0   
2  255     A     C     69.0   26.0

df2:

   num    Ref   Alt    DP     NL
1  500     A     A    45.0   54.0   
2  255     A     A    55.0   55.0

or get it in the same dataframe, if possible...

I've tried to think of few solutions,one is concatenating them and then remove the duplicates, but beacuse they are bif dataframes, it might cause problems,or looping through the dataframes and use the below script:

temp_df = df_list[0]

for df in df_list[1:]:

    df_filtered_for_diff = pd.merge(temp_df, df, on=['num', 'Ref', 'Alt'], how='outer')                                                

    temp_df = df_filtered_for_diff

return (temp_df)

but I was wondering if there are better solutions?

jpp · Accepted Answer

You can concatenate, then drop duplicates with keep=False:

df = pd.concat([df1, df2])\
       .drop_duplicates(subset=['num', 'Ref', 'Alt'], keep=False)

print(df)

   num Ref Alt    DP    NL
1  500   C   A  45.0  90.0
2  255   A   C  69.0  26.0
1  500   A   A  45.0  54.0
2  255   A   A  55.0  55.0

Get the difference between dataframes according to specific columns pandas

Tags:

python

python-3.x

pandas

dataframe

Bella

1 Answers

jpp

Recent Activity

Donate For Us

Get the difference between dataframes according to specific columns pandas

Tags:

python

python-3.x

pandas

dataframe

Bella

1 Answers

jpp

Related questions

Recent Activity

Donate For Us