I am looking to remove duplicates "within" a group. How can I do this in the most efficient way?
I have tried just grouping the data by ID, but since the companies can raise the same type of investment rounds in different years, this approach leads me to a wrong result.
I have data like this:
+----+-----------+-----------+---------------+
| ID | Type | seed_year | series_a_year |
+----+-----------+-----------+---------------+
| 1 | seed | 2014 | 0 |
| 2 | seed | 2014 | 0 |
| 2 | seed | 2015 | 0 |
| 3 | seed | 2012 | 0 |
| 3 | series_a | 0 | 2014 |
| 3 | series_a | 0 | 2015 |
+----+-----------+-----------+---------------+
Where my desired output would be:
+----+----------+-----------+---------------+
| ID | Type | seed_year | series_a_year |
+----+----------+-----------+---------------+
| 1 | seed | 2014 | 0 |
| 2 | seed | 2014 | 0 |
| 3 | seed | 2012 | 0 |
| 3 | series_a | 0 | 2014 |
+----+----------+-----------+---------------+
I would like to keep the first (oldest) funding round.
You can use the 'subset' argument of .drop_duplicates():
df.drop_duplicates(subset=['ID', 'Type'], keep='first')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With