Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Delete duplicates in a dataframe based on two columns combinations?

Tags:

I have a dataframe with 3 columns in Python:

Name1 Name2 Value
Juan  Ale   1
Ale   Juan  1

and would like to eliminate the duplicates based on columns Name1 and Name2 combinations.

In my example both rows are equal (but they are in different order), and I would like to delete the second row and just keep the first one, so the end result should be:

Name1 Name2 Value
Juan  Ale   1

Any idea will be really appreciated!

like image 692
Juan Avatar asked Jul 05 '18 01:07

Juan


People also ask

How do I delete duplicate rows based on two columns in pandas?

By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.

How do you get rid of duplicates in combinations?

Remove duplicate combinations – TutorialSelect your data. Click on Data > Remove Duplicates button.


2 Answers

By using np.sort with duplicated

df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
Out[614]: 
  Name1 Name2  Value
1   Ale  Juan      1

Performance

df=pd.concat([df]*100000)

%timeit df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
10 loops, best of 3: 69.3 ms per loop
%timeit df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
1 loop, best of 3: 3.72 s per loop
like image 104
BENY Avatar answered Sep 17 '22 12:09

BENY


You can convert to frozenset and use pd.DataFrame.duplicated.

res = df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]

print(res)

  Name1 Name2  Value
0  Juan   Ale      1

frozenset is necessary instead of set since duplicated uses hashing to check for duplicates.

Scales better with columns than rows. For a large number of rows, use @Wen's sort-based algorithm.

like image 31
jpp Avatar answered Sep 19 '22 12:09

jpp