How to remove rows with duplicate index values?
In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file.
I'm reading some automated weather data from the web (observations occur every 5 minutes, and compiled into monthly files for each weather station.) After parsing a file, the DataFrame looks like:
Sta Precip1hr Precip5min Temp DewPnt WindSpd WindDir AtmPress Date 2001-01-01 00:00:00 KPDX 0 0 4 3 0 0 30.31 2001-01-01 00:05:00 KPDX 0 0 4 3 0 0 30.30 2001-01-01 00:10:00 KPDX 0 0 4 3 4 80 30.30 2001-01-01 00:15:00 KPDX 0 0 3 2 5 90 30.30 2001-01-01 00:20:00 KPDX 0 0 3 2 10 110 30.28
Example of a duplicate case:
import pandas import datetime startdate = datetime.datetime(2001, 1, 1, 0, 0) enddate = datetime.datetime(2001, 1, 1, 5, 0) index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H') data1 = {'A' : range(6), 'B' : range(6)} data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]} df1 = pandas.DataFrame(data=data1, index=index) df2 = pandas.DataFrame(data=data2, index=index[:3]) df3 = df2.append(df1) df3 A B 2001-01-01 00:00:00 20 -50 2001-01-01 01:00:00 -30 60 2001-01-01 02:00:00 40 -70 2001-01-01 03:00:00 3 3 2001-01-01 04:00:00 4 4 2001-01-01 05:00:00 5 5 2001-01-01 00:00:00 0 0 2001-01-01 01:00:00 1 1 2001-01-01 02:00:00 2 2
And so I need df3
to eventually become:
A B 2001-01-01 00:00:00 0 0 2001-01-01 01:00:00 1 1 2001-01-01 02:00:00 2 2 2001-01-01 03:00:00 3 3 2001-01-01 04:00:00 4 4 2001-01-01 05:00:00 5 5
I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])
) would help me select the bottom-most row for any value of the DatetimeIndex
, but I am stuck on figuring out the group_by
or pivot
(or ???) statements to make that work.
Pandas. Index. drop_duplicates() function is used to drop/remove duplicates from an index. It is often required to remove duplicate data as part of Data analysis.
You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.
drop_duplicates() function to drop all the occurrences of the duplicate value except the first occurrence. Output : Let's drop all occurrences of duplicate value in the Index except the first occurrence.
I would suggest using the duplicated method on the Pandas Index itself:
df3 = df3[~df3.index.duplicated(keep='first')]
While all the other methods work, .drop_duplicates
is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.
Using the sample data provided:
>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index') 1000 loops, best of 3: 1.54 ms per loop >>> %timeit df3.groupby(df3.index).first() 1000 loops, best of 3: 580 µs per loop >>> %timeit df3[~df3.index.duplicated(keep='first')] 1000 loops, best of 3: 307 µs per loop
Note that you can keep the last element by changing the keep argument to 'last'
.
It should also be noted that this method works with MultiIndex
as well (using df1 as specified in Paul's example):
>>> %timeit df1.groupby(level=df1.index.names).last() 1000 loops, best of 3: 771 µs per loop >>> %timeit df1[~df1.index.duplicated(keep='last')] 1000 loops, best of 3: 365 µs per loop
This adds the index as a DataFrame column, drops duplicates on that, then removes the new column:
df = (df.reset_index() .drop_duplicates(subset='index', keep='last') .set_index('index').sort_index())
Note that the use of .sort_index()
above at the end is as needed and is optional.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With