What's the most efficient way to drop only consecutive duplicates in pandas?
drop_duplicates gives this:
In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5]) In [4]: a.drop_duplicates() Out[4]: 1 1 2 2 4 3 dtype: int64
But I want this:
In [4]: a.something() Out[4]: 1 1 2 2 4 3 5 2 dtype: int64
To drop consecutive duplicates with Python Pandas, we can use shift . to check if the last column isn't equal the current one with a. shift(-1) !=
To remove duplicates on specific column(s), use subset . To remove duplicates and keep last occurrences, use keep .
By default, when you concatenate two dataframes with duplicate records, Pandas automatically combine them together without removing the duplicate rows.
Use shift
:
a.loc[a.shift(-1) != a] Out[3]: 1 1 3 2 4 3 5 2 dtype: int64
So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask
Another method is to use diff
:
In [82]: a.loc[a.diff() != 0] Out[82]: 1 1 2 2 4 3 5 2 dtype: int64
But this is slower than the original method if you have a large number of rows.
Update
Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1)
or just shift()
as the default is a period of 1, this returns the first consecutive value:
In [87]: a.loc[a.shift() != a] Out[87]: 1 1 2 2 4 3 5 2 dtype: int64
Note the difference in index values, thanks @BjarkeEbert!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With