Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding duplicates in two dataframes and removing the duplicates from one dataframe

Working in Python / pandas / dataframes

I have these two dataframes:

Dataframe one:

          1          2    3 
 1   Stockholm     100    250
 2   Stockholm     150    376
 3   Stockholm     105    235
 4   Stockholm     109    104
 5   Burnley       145    234
 6   Burnley       100    250

Dataframe two:

          1          2    3 
 1   Stockholm     100    250
 2   Stockholm     117    128
 3   Stockholm     105    235
 4   Stockholm     100    250
 5   Burnley       145    234
 6   Burnley       100    953

And I would like to find the duplicate rows found in Dataframe one and Dataframe two and remove the duplicates from Dataframe one. As in data frame two, you can find row 1, 3, 5 in data frame one, which would remove them from data frame on and create the below:

     1           2       3 
1    Stockholm   150     376
2    Stockholm   109     104
3    Burnley     100     250
like image 801
Tom Benson Avatar asked Mar 29 '26 15:03

Tom Benson


1 Answers

Use:

df_merge = pd.merge(df1, df2, on=[1,2,3], how='inner')
df1 = df1.append(df_merge) 

df1['Duplicated'] = df1.duplicated(keep=False) # keep=False marks the duplicated row with a True
df_final = df1[~df1['Duplicated']] # selects only rows which are not duplicated.
del df_final['Duplicated'] # delete the indicator column

The idea is as follows:

  1. do a inner join on all the columns
  2. append the output of the inner join to df1
  3. identify the duplicated rows in df1
  4. select the not duplicated rows in df1

Each number corresponds to each line of code.

like image 120
Ji Wei Avatar answered Mar 31 '26 03:03

Ji Wei



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!