Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

percentile rank in pandas in groups

I can't quite figure out how to write function to accomplish a grouped percentile. I have all teams from years 1985-2012 in a data frame; the first 10 are shown below: it's currently sorted by year. I was looking to give a percentile for LgRnk grouped by Year. So for instance, 23 LgRank (worst team) for 1985 would be a 100 percentile and a 1 LgRank (best team) for 1985 would be a 1 percentile. 30 LgRank (worst team) for 2010 would be 100 percentile, etc. It needs to be grouped by year b/c of the differing number of LgRnks.

    Team                WLPer   Year LgRnk   W  L
19  Sacramento Kings    0.378   1985    18  31  51
0   Atlanta Hawks       0.415   1985    17  34  48
17  Phoenix Suns        0.439   1985    16  36  46
4   Cleveland Cavaliers 0.439   1985    15  36  46
13  Milwaukee Bucks     0.720   1985    3   59  23
3   Chicago Bulls       0.463   1985    14  38  44
16  Philadelphia 76ers  0.707   1985    4   58  24
22  Washington Wizards  0.488   1985    13  40  42
20  San Antonio Spurs   0.500   1985    12  41  41
21  Utah Jazz           0.500   1985    11  41  41

I've tried creating a function using: scipy.stats.percentileofscore and I can't quite get it.

like image 981
itjcms18 Avatar asked Mar 12 '14 00:03

itjcms18


People also ask

How do you get a percentile rank in pandas?

Let us see how to find the percentile rank of a column in a Pandas DataFrame. We will use the rank() function with the argument pct = True to find the percentile rank.

How do you convert percentile to rank?

As a result, the formula for calculating the JEE Main rank through percentile for January is as follows: JEE Main probable rank = (100- NTA percentile score) X 869010 /100. If the NTA percentile score is 90.70, JEE's Main rank will be (100-90.70 ) X 869010/100 = 80818.


2 Answers

You can do an apply on the LgRnk column:

# just for me to normalize this, so my numbers will go from 0 to 1 in this example
In [11]: df['LgRnk'] = g.LgRnk.rank()

In [12]: g = df.groupby('Year')

In [13]: g.LgRnk.apply(lambda x: x / len(x))
Out[13]:
19    1.0
0     0.9
17    0.8
4     0.7
13    0.1
3     0.6
16    0.2
22    0.5
20    0.4
21    0.3
Name: 1985, dtype: float64

The Series groupby rank (which just applies Series.rank) take a pct argument to do just this:

In [21]: g.LgRnk.rank(pct=True)
Out[21]:
19    1.0
0     0.9
17    0.8
4     0.7
13    0.1
3     0.6
16    0.2
22    0.5
20    0.4
21    0.3
Name: 1985, dtype: float64

and directly on the WLPer column (although this is slightly different due to draws):

In [22]: g.WLPer.rank(pct=True, ascending=False)
Out[22]:
19    1.00
0     0.90
17    0.75
4     0.75
13    0.10
3     0.60
16    0.20
22    0.50
20    0.35
21    0.35
Name: 1985, dtype: float64

Note: I've changed the numbers on the first line, so you'll get different scores on your complete frame.

like image 105
Andy Hayden Avatar answered Oct 19 '22 17:10

Andy Hayden


You need to calculate rank within the group before normalizing within the group. The other answers will result in percentiles over 100%. I suggest:

df['percentile'] = df.groupby('year')['LgRnk'].rank(pct=True)
like image 28
user636224 Avatar answered Oct 19 '22 19:10

user636224