Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Drop consecutive duplicates

Tags:

python

pandas

What's the most efficient way to drop only consecutive duplicates in pandas?

drop_duplicates gives this:

In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])  In [4]: a.drop_duplicates() Out[4]:  1    1 2    2 4    3 dtype: int64 

But I want this:

In [4]: a.something() Out[4]:  1    1 2    2 4    3 5    2 dtype: int64 
like image 857
Thomas Johnson Avatar asked Oct 19 '13 08:10

Thomas Johnson


People also ask

How do I get rid of consecutive duplicates in pandas?

To drop consecutive duplicates with Python Pandas, we can use shift . to check if the last column isn't equal the current one with a. shift(-1) !=

How do I drop duplicates in pandas?

To remove duplicates on specific column(s), use subset . To remove duplicates and keep last occurrences, use keep .

Does Panda concat remove duplicates?

By default, when you concatenate two dataframes with duplicate records, Pandas automatically combine them together without removing the duplicate rows.


1 Answers

Use shift:

a.loc[a.shift(-1) != a]  Out[3]:  1    1 3    2 4    3 5    2 dtype: int64 

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:  a.loc[a.diff() != 0] Out[82]: 1    1 2    2 4    3 5    2 dtype: int64 

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:  a.loc[a.shift() != a] Out[87]: 1    1 2    2 4    3 5    2 dtype: int64 

Note the difference in index values, thanks @BjarkeEbert!

like image 108
EdChum Avatar answered Sep 18 '22 23:09

EdChum