Pandas: Drop consecutive duplicates

Tags:

python

pandas

What's the most efficient way to drop only consecutive duplicates in pandas?

drop_duplicates gives this:

In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])  In [4]: a.drop_duplicates() Out[4]:  1    1 2    2 4    3 dtype: int64

But I want this:

In [4]: a.something() Out[4]:  1    1 2    2 4    3 5    2 dtype: int64

857

asked Oct 19 '13 08:10

Thomas Johnson

1 Answers

Use shift:

a.loc[a.shift(-1) != a]  Out[3]:  1    1 3    2 4    3 5    2 dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:  a.loc[a.diff() != 0] Out[82]: 1    1 2    2 4    3 5    2 dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:  a.loc[a.shift() != a] Out[87]: 1    1 2    2 4    3 5    2 dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

108

answered Sep 18 '22 23:09

EdChum

Related questions
                            
                                Python threading. How do I lock a thread?
                            
                                Should Python class filenames also be camelCased?
                            
                                How do I set up Vim autoindentation properly for editing Python files?
                            
                                Convert Variable Name to String?
                            
                                Python variables as keys to dict
                            
                                How do you add additional files to a wheel?
                            
                                Access self from decorator
                            
                                Logging variable data with new format string
                            
                                How do threads work in Python, and what are common Python-threading specific pitfalls?
                            
                                Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python [duplicate]
                            
                                get dataframe row count based on conditions
                            
                                Accuracy Score ValueError: Can't Handle mix of binary and continuous target
                            
                                How slow is Python's string concatenation vs. str.join?
                            
                                Cannot pass an argument to python with "#!/usr/bin/env python"
                            
                                Pandas dataframe read_csv on bad data
                            
                                Convert JSON array to Python list
                            
                                Python get proper line ending
                            
                                How to build URLs in Python [closed]
                            
                                Monkey patching a class in another module in Python
                            
                                How to pass a variable to magic ´run´ function in IPython

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With