How to keep first two duplicates in a pandas dataframe?

Tags:

I have a question in regards to finding duplicates in a dataframe, and removing duplicates in a dataframe using a specific column. Here is what I am trying to accomplish:

Is it possible to remove duplicates but keep the first 2?

Here is an example of my current dataframe called df and take a look at the bracket notes I have placed below to give you an idea.

Note: If 'Roll' = 1 then I want to look at the Date column, see if there is a second duplicate Date in that column... keep those two and delete any others.

    Date    Open    High     Low      Close  Roll  Dupes
1  19780106  236.00  237.50  234.50  235.50     0    NaN
2  19780113  235.50  239.00  235.00  238.25     0    NaN
3  19780120  238.00  239.00  234.50  237.00     0    NaN
4  19780127  237.00  238.50  235.50  236.00     1    NaN (KEEP)  
5  19780203  236.00  236.00  232.25  233.50     0    NaN (KEEP)
6  19780127  237.00  238.50  235.50  236.00     0    NaN (KEEP)
7  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)
8  19780127  237.00  238.50  235.50  236.00     0    NaN (DELETE)
9  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)

This is what is currently removing the dupes BUT it's removing all dupes (obviously)

df = df.drop_duplicates('Date')

EDIT: I forgot to mention something, the only duplicate I want to keep is if column 'Roll' = 1 if it does, then keep that row and the next one that matches based on column 'Date'

261

asked Sep 11 '15 19:09

2 Answers

Using head with a groupby keeps the first x entries in each group, which I think accomplishes what you want.

In [52]: df.groupby('Date').head(2)
Out[52]: 
       Date   Open   High     Low   Close  Roll
1  19780106  236.0  237.5  234.50  235.50     0
2  19780113  235.5  239.0  235.00  238.25     0
3  19780120  238.0  239.0  234.50  237.00     0
4  19780127  237.0  238.5  235.50  236.00     0
5  19780203  236.0  236.0  232.25  233.50     0
6  19780127  237.0  238.5  235.50  236.00     0
7  19780203  236.0  236.0  232.25  233.50     0

Edit:

In [16]: df['dupe_count'] = df.groupby('Date')['Roll'].transform('max') + 1

In [17]: df.groupby('Date', as_index=False).apply(lambda x: x.head(x['dupe_count'].iloc[0]))
Out[17]: 
         Date   Open   High     Low   Close  Roll  Dupes  dupe_count
0 1  19780106  236.0  237.5  234.50  235.50     0    NaN           1
1 2  19780113  235.5  239.0  235.00  238.25     0    NaN           1
2 3  19780120  238.0  239.0  234.50  237.00     0    NaN           1
3 4  19780127  237.0  238.5  235.50  236.00     1    NaN           2
  6  19780127  237.0  238.5  235.50  236.00     0    NaN           2
4 5  19780203  236.0  236.0  232.25  233.50     0    NaN           1

137

answered Sep 17 '22 23:09

chrisb

Assuming Roll can only take the values 0 and 1, if you do

df.groupby(['Date', 'Roll'], as_index=False).first()

you will get two rows for dates for which one of the rows had Roll = 1 and only one row for dates which have only Roll = 0, which I think is what you want.
If passed as_index=False so that the group keys don't end up in the index as discussed in your comment.

answered Sep 20 '22 23:09

JoeCondron

Related questions
                            
                                Python - Raw String Literals
                            
                                Python OpenCV drawing errors after manipulating array with numpy
                            
                                Scatter a 2D numpy array in matplotlib
                            
                                Get previous object without len(list)
                            
                                Fuzzy text search in Python
                            
                                Django 1.8 inspectdb command doesn't see PostgreSQL views as per documentation
                            
                                Replace in string based on function ouput
                            
                                How do I find and remove white specks from an image using SciPy/NumPy?
                            
                                python 3.4.2 urlib no attribute 'pathname2url'
                            
                                how to install QtSvg,QtWebKit,QtWebKitWidgets(all in Qt5 version) on ubuntu 14.04?
                            
                                mandrill template variables not substituting
                            
                                <__main__. object at 0x02C08790>
                            
                                Python3 - Sympy: expand products of trig functions
                            
                                python psutil psutil.get_process_list() error
                            
                                How to serialize sympy lambdified function?
                            
                                How can this code print Hello World without any print statement
                            
                                Better way to create block matrices out of individual blocks in numpy?
                            
                                How can I duplicate an xml element using Python?
                            
                                Current way to get Home URL (Domain) in Django Template?
                            
                                Scipy, differential evolution

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to keep first two duplicates in a pandas dataframe?

Tags:

python

pandas

dataframe

duplicates

antonio_zeus

People also ask

2 Answers

chrisb

JoeCondron

Recent Activity

Donate For Us