I have a typical "panel data" (in econometric terms, not pandas panel object). The dataframe has a Date
column and a ID
column, and other columns that contain certain values. For each Date, I need to cross-sectionally rank across IDs based on V1 into 10 groups (deciles) and create a new column called rank_col
(take values 1 to 10) to identify rank. Then pool all the rank1, rank2,...rank10 across time to get some stats like mean,std.
This can be accomplished easily in SAS by following code and it also illustrate what my purpose:
proc sort data=df;
by Date;
proc rank data=df out=df_ranked groups=10;
var V1;
ranks rank_col;
by Date;
run;
df_ranked
is identical to df
except that it has more column called rank_col
which containes the rank group that each row belongs to.
Sorry I don't have a sample data to show the structure, a real long data is needed to illustrate. But the SAS code shows exactly what I am after.
Thanks for your help!
It could be simpler without needing separate foo
In [782]: df.groupby('Date')['V1'].transform(lambda x: pd.qcut(x, 10, labels=False))
Out[782]:
0 6
1 4
2 3
3 8
4 9
5 1
6 0
7 0
8 1
9 7
10 8
11 5
12 2
13 4
14 9
Name: V1, dtype: int64
Assign to column
In [783]: df['ranks'] = df.groupby('Date')['V1'].transform(pd.qcut, 10, labels=False)
In [784]: df
Out[784]:
Date id V1 ranks
0 2013-01-01 1 10 6
1 2013-01-01 2 8 4
2 2013-01-01 3 6 3
3 2013-01-01 4 11 8
4 2013-01-01 5 13 9
5 2013-01-01 6 4 1
6 2013-01-01 7 2 0
7 2013-02-01 1 1 0
8 2013-02-01 2 3 1
9 2013-02-01 3 9 7
10 2013-02-01 4 11 8
11 2013-02-01 5 7 5
12 2013-02-01 6 4 2
13 2013-02-01 7 6 4
14 2013-02-01 8 14 9
Details
In [786]: df
Out[786]:
Date id V1
0 2013-01-01 1 10
1 2013-01-01 2 8
2 2013-01-01 3 6
3 2013-01-01 4 11
4 2013-01-01 5 13
5 2013-01-01 6 4
6 2013-01-01 7 2
7 2013-02-01 1 1
8 2013-02-01 2 3
9 2013-02-01 3 9
10 2013-02-01 4 11
11 2013-02-01 5 7
12 2013-02-01 6 4
13 2013-02-01 7 6
14 2013-02-01 8 14
A way I just find figured out:
def grouping(data):
dec=pd.qcut(data['V1'],10,labels=False)
data['ranks']=dec
return data
df_ranked=df.groupby('Date').apply(grouping)
This assumes dec
preserve the right position for each row.
Please post if you have a better way, or point out any mistakes in this method.
Thanks!
Edit: You can just return a single new ranks
column if you do something like the following:
>>> df
Date id V1
0 2013-01-01 1 10
1 2013-01-01 2 8
2 2013-01-01 3 6
3 2013-01-01 4 11
4 2013-01-01 5 13
5 2013-01-01 6 4
6 2013-01-01 7 2
7 2013-02-01 1 1
8 2013-02-01 2 3
9 2013-02-01 3 9
10 2013-02-01 4 11
11 2013-02-01 5 7
12 2013-02-01 6 4
13 2013-02-01 7 6
14 2013-02-01 8 14
>>> foo = lambda x: pd.Series(pd.qcut(x,10,labels=False),index=x.index)
>>> df['ranks'] = df.groupby('Date')['V1'].apply(foo)
>>> df
Date id V1 ranks
0 2013-01-01 1 10 6
1 2013-01-01 2 8 4
2 2013-01-01 3 6 3
3 2013-01-01 4 11 8
4 2013-01-01 5 13 9
5 2013-01-01 6 4 1
6 2013-01-01 7 2 0
7 2013-02-01 1 1 0
8 2013-02-01 2 3 1
9 2013-02-01 3 9 7
10 2013-02-01 4 11 8
11 2013-02-01 5 7 5
12 2013-02-01 6 4 2
13 2013-02-01 7 6 4
14 2013-02-01 8 14 9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With