I have a question in regards to finding duplicates in a dataframe, and removing duplicates in a dataframe using a specific column. Here is what I am trying to accomplish:
Is it possible to remove duplicates but keep the first 2?
Here is an example of my current dataframe called df and take a look at the bracket notes I have placed below to give you an idea.
Note: If 'Roll' = 1 then I want to look at the Date column, see if there is a second duplicate Date in that column... keep those two and delete any others.
Date Open High Low Close Roll Dupes
1 19780106 236.00 237.50 234.50 235.50 0 NaN
2 19780113 235.50 239.00 235.00 238.25 0 NaN
3 19780120 238.00 239.00 234.50 237.00 0 NaN
4 19780127 237.00 238.50 235.50 236.00 1 NaN (KEEP)
5 19780203 236.00 236.00 232.25 233.50 0 NaN (KEEP)
6 19780127 237.00 238.50 235.50 236.00 0 NaN (KEEP)
7 19780203 236.00 236.00 232.25 233.50 0 NaN (DELETE)
8 19780127 237.00 238.50 235.50 236.00 0 NaN (DELETE)
9 19780203 236.00 236.00 232.25 233.50 0 NaN (DELETE)
This is what is currently removing the dupes BUT it's removing all dupes (obviously)
df = df.drop_duplicates('Date')
EDIT: I forgot to mention something, the only duplicate I want to keep is if column 'Roll' = 1 if it does, then keep that row and the next one that matches based on column 'Date'
Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.
We can use Pandas built-in method drop_duplicates() to drop duplicate rows. Note that we started out as 80 rows, now it's 77. By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.
Reindexing the columns using axis keyword One can reindex a single column or multiple columns by using reindex() method and by specifying the axis we want to reindex. Default values in the new index that are not present in the dataframe are assigned NaN.
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Using head
with a groupby keeps the first x entries in each group, which I think accomplishes what you want.
In [52]: df.groupby('Date').head(2)
Out[52]:
Date Open High Low Close Roll
1 19780106 236.0 237.5 234.50 235.50 0
2 19780113 235.5 239.0 235.00 238.25 0
3 19780120 238.0 239.0 234.50 237.00 0
4 19780127 237.0 238.5 235.50 236.00 0
5 19780203 236.0 236.0 232.25 233.50 0
6 19780127 237.0 238.5 235.50 236.00 0
7 19780203 236.0 236.0 232.25 233.50 0
Edit:
In [16]: df['dupe_count'] = df.groupby('Date')['Roll'].transform('max') + 1
In [17]: df.groupby('Date', as_index=False).apply(lambda x: x.head(x['dupe_count'].iloc[0]))
Out[17]:
Date Open High Low Close Roll Dupes dupe_count
0 1 19780106 236.0 237.5 234.50 235.50 0 NaN 1
1 2 19780113 235.5 239.0 235.00 238.25 0 NaN 1
2 3 19780120 238.0 239.0 234.50 237.00 0 NaN 1
3 4 19780127 237.0 238.5 235.50 236.00 1 NaN 2
6 19780127 237.0 238.5 235.50 236.00 0 NaN 2
4 5 19780203 236.0 236.0 232.25 233.50 0 NaN 1
Assuming Roll
can only take the values 0 and 1, if you do
df.groupby(['Date', 'Roll'], as_index=False).first()
you will get two rows for dates for which one of the rows had Roll = 1
and only one row for dates which have only Roll = 0
, which I think is what you want.
If passed as_index=False
so that the group keys don't end up in the index as discussed in your comment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With