Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ranks within groupby in pandas

I have a typical "panel data" (in econometric terms, not pandas panel object). The dataframe has a Date column and a ID column, and other columns that contain certain values. For each Date, I need to cross-sectionally rank across IDs based on V1 into 10 groups (deciles) and create a new column called rank_col (take values 1 to 10) to identify rank. Then pool all the rank1, rank2,...rank10 across time to get some stats like mean,std.

This can be accomplished easily in SAS by following code and it also illustrate what my purpose:

proc sort data=df;
    by Date;
proc rank data=df out=df_ranked groups=10;
    var V1;
    ranks rank_col;
    by Date;
run;

df_ranked is identical to df except that it has more column called rank_col which containes the rank group that each row belongs to.

Sorry I don't have a sample data to show the structure, a real long data is needed to illustrate. But the SAS code shows exactly what I am after.

Thanks for your help!

like image 359
user3576212 Avatar asked Jun 01 '14 02:06

user3576212


2 Answers

It could be simpler without needing separate foo

In [782]: df.groupby('Date')['V1'].transform(lambda x: pd.qcut(x, 10, labels=False))
Out[782]:
0     6
1     4
2     3
3     8
4     9
5     1
6     0
7     0
8     1
9     7
10    8
11    5
12    2
13    4
14    9
Name: V1, dtype: int64

Assign to column

In [783]: df['ranks'] = df.groupby('Date')['V1'].transform(pd.qcut, 10, labels=False)

In [784]: df
Out[784]:
          Date  id  V1  ranks
0   2013-01-01   1  10      6
1   2013-01-01   2   8      4
2   2013-01-01   3   6      3
3   2013-01-01   4  11      8
4   2013-01-01   5  13      9
5   2013-01-01   6   4      1
6   2013-01-01   7   2      0
7   2013-02-01   1   1      0
8   2013-02-01   2   3      1
9   2013-02-01   3   9      7
10  2013-02-01   4  11      8
11  2013-02-01   5   7      5
12  2013-02-01   6   4      2
13  2013-02-01   7   6      4
14  2013-02-01   8  14      9

Details

In [786]: df
Out[786]:
          Date  id  V1
0   2013-01-01   1  10
1   2013-01-01   2   8
2   2013-01-01   3   6
3   2013-01-01   4  11
4   2013-01-01   5  13
5   2013-01-01   6   4
6   2013-01-01   7   2
7   2013-02-01   1   1
8   2013-02-01   2   3
9   2013-02-01   3   9
10  2013-02-01   4  11
11  2013-02-01   5   7
12  2013-02-01   6   4
13  2013-02-01   7   6
14  2013-02-01   8  14
like image 133
Zero Avatar answered Sep 28 '22 13:09

Zero


A way I just find figured out:

def grouping(data):
    dec=pd.qcut(data['V1'],10,labels=False)
    data['ranks']=dec
    return data
df_ranked=df.groupby('Date').apply(grouping)

This assumes dec preserve the right position for each row.

Please post if you have a better way, or point out any mistakes in this method.

Thanks!

Edit: You can just return a single new ranks column if you do something like the following:

>>> df


         Date  id  V1
0  2013-01-01   1  10
1  2013-01-01   2   8
2  2013-01-01   3   6
3  2013-01-01   4  11
4  2013-01-01   5  13
5  2013-01-01   6   4
6  2013-01-01   7   2
7  2013-02-01   1   1
8  2013-02-01   2   3
9  2013-02-01   3   9
10 2013-02-01   4  11
11 2013-02-01   5   7
12 2013-02-01   6   4
13 2013-02-01   7   6
14 2013-02-01   8  14

>>> foo = lambda x: pd.Series(pd.qcut(x,10,labels=False),index=x.index)
>>> df['ranks'] = df.groupby('Date')['V1'].apply(foo)
>>> df

         Date  id  V1  ranks
0  2013-01-01   1  10      6
1  2013-01-01   2   8      4
2  2013-01-01   3   6      3
3  2013-01-01   4  11      8
4  2013-01-01   5  13      9
5  2013-01-01   6   4      1
6  2013-01-01   7   2      0
7  2013-02-01   1   1      0
8  2013-02-01   2   3      1
9  2013-02-01   3   9      7
10 2013-02-01   4  11      8
11 2013-02-01   5   7      5
12 2013-02-01   6   4      2
13 2013-02-01   7   6      4
14 2013-02-01   8  14      9
like image 33
user3576212 Avatar answered Sep 28 '22 12:09

user3576212