Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep earliest record for each category but without considering the extra columns?

Let's say I have a data table with 3 columns:

Category             Color              Date
triangle             red                2017-10-10
square               yellow             2017-11-10
triangle             blue               2017-02-10
circle               yellow             2017-07-10
circle               red                2017-09-10

I want to find out the earliest date by each category. So my desired output is:

Category             Color              Date
square               yellow             2017-11-10
triangle             blue               2017-02-10
circle               yellow             2017-07-10

I've looked through a couple posts about how to do this:

Finding the min date in a Pandas DF row and create new Column

Pandas groupby category, rating, get top value from each category?

With Pandas in Python, select the highest value row for each group

and more.

A popular method is the groupby method:

df.groupby('Category').first().reset_index() 

But if I use this method, then it'll group by Category, but it'll keep both records for triangle since it has two different colors.

Is there a better and more efficient way to do this?

like image 530
alwaysaskingquestions Avatar asked Dec 12 '25 14:12

alwaysaskingquestions


2 Answers

You could use sort_values + drop_duplicates:

df.sort_values(['Date']).drop_duplicates('Category', keep='first')

   Category   Color        Date
2  triangle    blue  2017-02-10
3    circle  yellow  2017-07-10
1    square  yellow  2017-11-10

If you want to preserve the original order of Category, you'll need to sort on a groupby call:

df.groupby('Category', group_keys=False, sort=False)\
  .apply(lambda x: x.sort_values('Date'))\
  .drop_duplicates('Category', keep='first')

   Category   Color        Date
2  triangle    blue  2017-02-10
1    square  yellow  2017-11-10
3    circle  yellow  2017-07-10
like image 101
cs95 Avatar answered Dec 14 '25 04:12

cs95


The following should give you the desired output; compare to what you posted I first sort the values according to the date as you want to keep the earliest date per category:

df.sort_values('Date').groupby('Category').first().reset_index()

That gives the desired output:

   Category   Color        Date
0    circle  yellow  2017-07-10
1    square  yellow  2017-11-10
2  triangle    blue  2017-02-10

EDIT

Thanks to @Wen in the comments, one can make this call also more efficient by doing:

df.sort_values('Date').groupby('Category', as_index=False).first()

which also gives

   Category   Color        Date
0    circle  yellow  2017-07-10
1    square  yellow  2017-11-10
2  triangle    blue  2017-02-10
like image 40
Cleb Avatar answered Dec 14 '25 04:12

Cleb