Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicates within a group in Pandas

I am looking to remove duplicates "within" a group. How can I do this in the most efficient way?

I have tried just grouping the data by ID, but since the companies can raise the same type of investment rounds in different years, this approach leads me to a wrong result.

I have data like this:

+----+-----------+-----------+---------------+
| ID |   Type    | seed_year | series_a_year |
+----+-----------+-----------+---------------+
|  1 | seed      |      2014 |             0 |
|  2 | seed      |      2014 |             0 |
|  2 | seed      |      2015 |             0 |
|  3 | seed      |      2012 |             0 |
|  3 | series_a  |         0 |          2014 |
|  3 | series_a  |         0 |          2015 |
+----+-----------+-----------+---------------+

Where my desired output would be:

+----+----------+-----------+---------------+
| ID |   Type   | seed_year | series_a_year |
+----+----------+-----------+---------------+
|  1 | seed     |      2014 |             0 |
|  2 | seed     |      2014 |             0 |
|  3 | seed     |      2012 |             0 |
|  3 | series_a |         0 |          2014 |
+----+----------+-----------+---------------+

I would like to keep the first (oldest) funding round.

like image 512
dabe16ab Avatar asked Oct 31 '25 09:10

dabe16ab


1 Answers

You can use the 'subset' argument of .drop_duplicates():

df.drop_duplicates(subset=['ID', 'Type'], keep='first')
like image 73
KenHBS Avatar answered Nov 03 '25 00:11

KenHBS



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!