Remove pandas rows with duplicate indices

Tags:

How to remove rows with duplicate index values?

In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file.

I'm reading some automated weather data from the web (observations occur every 5 minutes, and compiled into monthly files for each weather station.) After parsing a file, the DataFrame looks like:

Click to copy

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress Date                                                                                       2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31 2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30 2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30 2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30 2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

Example of a duplicate case:

Click to copy

import pandas  import datetime  startdate = datetime.datetime(2001, 1, 1, 0, 0) enddate = datetime.datetime(2001, 1, 1, 5, 0) index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H') data1 = {'A' : range(6), 'B' : range(6)} data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]} df1 = pandas.DataFrame(data=data1, index=index) df2 = pandas.DataFrame(data=data2, index=index[:3]) df3 = df2.append(df1)  df3                        A   B 2001-01-01 00:00:00   20 -50 2001-01-01 01:00:00  -30  60 2001-01-01 02:00:00   40 -70 2001-01-01 03:00:00    3   3 2001-01-01 04:00:00    4   4 2001-01-01 05:00:00    5   5 2001-01-01 00:00:00    0   0 2001-01-01 01:00:00    1   1 2001-01-01 02:00:00    2   2

And so I need df3 to eventually become:

Click to copy

                       A   B 2001-01-01 00:00:00    0   0 2001-01-01 01:00:00    1   1 2001-01-01 02:00:00    2   2 2001-01-01 03:00:00    3   3 2001-01-01 04:00:00    4   4 2001-01-01 05:00:00    5   5

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

526

asked Oct 23 '12 17:10

Paul H

2 Answers

I would suggest using the duplicated method on the Pandas Index itself:

Click to copy

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

Click to copy

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index') 1000 loops, best of 3: 1.54 ms per loop  >>> %timeit df3.groupby(df3.index).first() 1000 loops, best of 3: 580 µs per loop  >>> %timeit df3[~df3.index.duplicated(keep='first')] 1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

Click to copy

>>> %timeit df1.groupby(level=df1.index.names).last() 1000 loops, best of 3: 771 µs per loop  >>> %timeit df1[~df1.index.duplicated(keep='last')] 1000 loops, best of 3: 365 µs per loop

188

answered Oct 02 '22 19:10

n8yoder

This adds the index as a DataFrame column, drops duplicates on that, then removes the new column:

Click to copy

df = (df.reset_index()         .drop_duplicates(subset='index', keep='last')         .set_index('index').sort_index())

Note that the use of .sort_index() above at the end is as needed and is optional.

answered Oct 02 '22 18:10

D. A.

Related questions
                            
                                How to start a background process in Python?
                            
                                Join a list of items with different types as string in Python
                            
                                How can I display full (non-truncated) dataframe information in HTML when converting from Pandas dataframe to HTML?
                            
                                Normalize columns of pandas data frame
                            
                                Total memory used by Python process?
                            
                                Convert a python dict to a string and back
                            
                                Finding and replacing elements in a list
                            
                                Django Model() vs Model.objects.create()
                            
                                Bare asterisk in function arguments?
                            
                                What does axis in pandas mean?
                            
                                Pandas 'count(distinct)' equivalent
                            
                                NumPy array is not JSON serializable
                            
                                What are some common uses for Python decorators? [closed]
                            
                                UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c
                            
                                How to add title to subplots in Matplotlib
                            
                                Is there a built in function for string natural sort?
                            
                                How to find which version of TensorFlow is installed in my system?
                            
                                What's the bad magic number error?
                            
                                Python subprocess/Popen with a modified environment
                            
                                How to dump a dict to a JSON file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove pandas rows with duplicate indices

Tags:

python

pandas

dataframe

duplicates

Paul H

People also ask

2 Answers

n8yoder

D. A.

Recent Activity

Donate For Us