I have a dataset with potentially duplicate records of the identifier appkey
. The duplicated records should ideally not exist and therefore I take them to be data collection mistakes. I need to drop all instances of an appkey
which occurs more than once.
The drop_duplicates
method is not useful in this case (or is it?) as it either selects the first or the last of the duplicates. Is there any obvious idiom to achieve this with pandas?
As of pandas version 0.12, we have filter
for this. It does exactly what @Andy's solution does using transform
, but a little more succinctly and somewhat faster.
df.groupby('AppKey').filter(lambda x: x.count() == 1)
To steal @Andy's example,
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['AppKey', 'B'])
In [2]: df.groupby('AppKey').filter(lambda x: x.count() == 1)
Out[2]:
AppKey B
2 5 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With