Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Drop all records of duplicate indices

I have a dataset with potentially duplicate records of the identifier appkey. The duplicated records should ideally not exist and therefore I take them to be data collection mistakes. I need to drop all instances of an appkey which occurs more than once.

The drop_duplicates method is not useful in this case (or is it?) as it either selects the first or the last of the duplicates. Is there any obvious idiom to achieve this with pandas?

like image 430
asb Avatar asked Sep 17 '13 13:09

asb


1 Answers

As of pandas version 0.12, we have filter for this. It does exactly what @Andy's solution does using transform, but a little more succinctly and somewhat faster.

df.groupby('AppKey').filter(lambda x: x.count() == 1)

To steal @Andy's example,

In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])

In [2]: df.groupby('AppKey').filter(lambda x: x.count() == 1)
Out[2]: 
   AppKey  B
2       5  6
like image 141
Dan Allan Avatar answered Oct 05 '22 19:10

Dan Allan