How to remove duplicates within a group in Pandas

Question

I am looking to remove duplicates "within" a group. How can I do this in the most efficient way?

I have tried just grouping the data by ID, but since the companies can raise the same type of investment rounds in different years, this approach leads me to a wrong result.

I have data like this:

+----+-----------+-----------+---------------+
| ID |   Type    | seed_year | series_a_year |
+----+-----------+-----------+---------------+
|  1 | seed      |      2014 |             0 |
|  2 | seed      |      2014 |             0 |
|  2 | seed      |      2015 |             0 |
|  3 | seed      |      2012 |             0 |
|  3 | series_a  |         0 |          2014 |
|  3 | series_a  |         0 |          2015 |
+----+-----------+-----------+---------------+

Where my desired output would be:

+----+----------+-----------+---------------+
| ID |   Type   | seed_year | series_a_year |
+----+----------+-----------+---------------+
|  1 | seed     |      2014 |             0 |
|  2 | seed     |      2014 |             0 |
|  3 | seed     |      2012 |             0 |
|  3 | series_a |         0 |          2014 |
+----+----------+-----------+---------------+

I would like to keep the first (oldest) funding round.

KenHBS · Accepted Answer

You can use the 'subset' argument of .drop_duplicates():

df.drop_duplicates(subset=['ID', 'Type'], keep='first')

How to remove duplicates within a group in Pandas

Tags:

python

pandas

pandas-groupby

dabe16ab

1 Answers

KenHBS

Recent Activity

Donate For Us

How to remove duplicates within a group in Pandas

Tags:

python

pandas

pandas-groupby

dabe16ab

1 Answers

KenHBS

Related questions

Recent Activity

Donate For Us