Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep first two duplicates in a pandas dataframe?

I have a question in regards to finding duplicates in a dataframe, and removing duplicates in a dataframe using a specific column. Here is what I am trying to accomplish:

Is it possible to remove duplicates but keep the first 2?

Here is an example of my current dataframe called df and take a look at the bracket notes I have placed below to give you an idea.

Note: If 'Roll' = 1 then I want to look at the Date column, see if there is a second duplicate Date in that column... keep those two and delete any others.

    Date    Open    High     Low      Close  Roll  Dupes
1  19780106  236.00  237.50  234.50  235.50     0    NaN
2  19780113  235.50  239.00  235.00  238.25     0    NaN
3  19780120  238.00  239.00  234.50  237.00     0    NaN
4  19780127  237.00  238.50  235.50  236.00     1    NaN (KEEP)  
5  19780203  236.00  236.00  232.25  233.50     0    NaN (KEEP)
6  19780127  237.00  238.50  235.50  236.00     0    NaN (KEEP)
7  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)
8  19780127  237.00  238.50  235.50  236.00     0    NaN (DELETE)
9  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)

This is what is currently removing the dupes BUT it's removing all dupes (obviously)

df = df.drop_duplicates('Date')

EDIT: I forgot to mention something, the only duplicate I want to keep is if column 'Roll' = 1 if it does, then keep that row and the next one that matches based on column 'Date'

like image 261
antonio_zeus Avatar asked Sep 11 '15 19:09

antonio_zeus


People also ask

Does Pandas drop duplicates keep first?

Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.

How does Pandas handle duplicates in DataFrame?

We can use Pandas built-in method drop_duplicates() to drop duplicate rows. Note that we started out as 80 rows, now it's 77. By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.

How do I reindex a data frame?

Reindexing the columns using axis keyword One can reindex a single column or multiple columns by using reindex() method and by specifying the axis we want to reindex. Default values in the new index that are not present in the dataframe are assigned NaN.

How do you mark duplicates in Pandas?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.


2 Answers

Using head with a groupby keeps the first x entries in each group, which I think accomplishes what you want.

In [52]: df.groupby('Date').head(2)
Out[52]: 
       Date   Open   High     Low   Close  Roll
1  19780106  236.0  237.5  234.50  235.50     0
2  19780113  235.5  239.0  235.00  238.25     0
3  19780120  238.0  239.0  234.50  237.00     0
4  19780127  237.0  238.5  235.50  236.00     0
5  19780203  236.0  236.0  232.25  233.50     0
6  19780127  237.0  238.5  235.50  236.00     0
7  19780203  236.0  236.0  232.25  233.50     0

Edit:

In [16]: df['dupe_count'] = df.groupby('Date')['Roll'].transform('max') + 1

In [17]: df.groupby('Date', as_index=False).apply(lambda x: x.head(x['dupe_count'].iloc[0]))
Out[17]: 
         Date   Open   High     Low   Close  Roll  Dupes  dupe_count
0 1  19780106  236.0  237.5  234.50  235.50     0    NaN           1
1 2  19780113  235.5  239.0  235.00  238.25     0    NaN           1
2 3  19780120  238.0  239.0  234.50  237.00     0    NaN           1
3 4  19780127  237.0  238.5  235.50  236.00     1    NaN           2
  6  19780127  237.0  238.5  235.50  236.00     0    NaN           2
4 5  19780203  236.0  236.0  232.25  233.50     0    NaN           1
like image 137
chrisb Avatar answered Sep 17 '22 23:09

chrisb


Assuming Roll can only take the values 0 and 1, if you do

df.groupby(['Date', 'Roll'], as_index=False).first() 

you will get two rows for dates for which one of the rows had Roll = 1 and only one row for dates which have only Roll = 0, which I think is what you want.
If passed as_index=False so that the group keys don't end up in the index as discussed in your comment.

like image 25
JoeCondron Avatar answered Sep 20 '22 23:09

JoeCondron