I am working with two dataframes:
df
contains a column be/me
for stocks for a 20-year period (on a monthly basis).df2
, a subset of df
(with only certain stocks, only for June) contains the column decile
, created via the pd.qcut()
method for every year in the 20-year period based on an altered version of df
's be/me
.Considering the deciles that I created in df2
, I wonder if it's possible to rank df
's be/me
based on df2
's decile
column. In other words, I wonder if it's possible to assign df
's be/me
values to the deciles created in df2
.
Please see dataframes below for a better understanding of the issue:
df
date stock_id be/me
2000-01-31 1004.0 0.3
2000-02-29 1004.0 0.7
2000-03-31 1004.0 1.2
2000-04-30 1004.0 2.3
2000-05-31 1004.0 0.9
... ... ...
2020-12-31 3900.0 1.7
2020-12-31 3900.0 2.8
2020-12-31 3900.0 3.0
2020-12-31 3900.0 0.2
2020-12-31 3900.0 2.1
1218855 rows × 3 columns
df2['deciles'] = df2.groupby('date')['be/me'].transform(lambda x: pd.qcut(x, 10, labels=False, duplicates = 'drop'))
df2
date stock_id be/me deciles
2000-06-30 2061.0 0.653684 5
2000-06-30 4383.0 0.053660 2
2000-06-30 13561.0 0.092509 2
2000-06-30 4065.0 1.342187 6
2000-06-30 2731.0 0.235582 3
... ... ... ...
2020-06-30 7022.0 0.072534 2
2020-06-30 30990.0 1.071096 6
2020-06-30 22867.0 1.627155 6
2020-06-30 15247.0 0.051387 2
2020-06-30 61574.0 1.684690 6
24095 rows × 4 columns
Note: date
is of type datetime
and, for each date, there are multiple stocks (stock_id
).
Thank you so much for your time.
What I want to do is to check in which df2
-created decile the original be/me
values (from the original dataframe df
) fit. The expected output should be a new column in df
with df2
-created deciles attributed to each and every be/me
value in df
.
Please let me know if there is any additional clarification necessary.
I created a function that loops through the deciles to fetch the maximum decile value for every date
in df2
. Not sure if I am heading in the right direction since the output is an array with no date
... take a look below:
In: def attribution(deciles,dates):
deciles = df2['deciles'].unique()
dates = df2.index.unique()
body_max = []
body_min = []
for x in deciles:
for y in dates:
body_max.append(df2[df2['deciles'] == x].loc[y]['be/me'].max())
body_min.append(df2[df2['deciles'] == x].loc[y]['be/me'].min())
return body_max, body_min
In: attribution(deciles, dates)
Out: [0.9343106070197438,
1.2747264875802489,
1.9700461181925901,
0.7888946814157697,
0.9304702071896337,
0.9651423313922733,
0.7238677612487585,
1.0358317574924074,
...]
To be clear: you want to know for each be/me
value in df
which decile it would have fallen into if that value had been in df2
? I see two cases:
If df2
covers the whole month of June (as you wrote), I am afraid there is no answer to that question: each day in the month will have decile bins with different edges (since you are doing a groupby('date')
on df2
). The same be/me
value in df
could belong to different deciles in df2
depending on the day in June you consider.
If df2
actually covers only one day in June (as your example above seems to indicate: 2020-06-30
), then you have one well defined set of decile bins.
In case 2), you could do that:
df
date stock_od be/me
0 2000-01-31 1004.0 0.3
1 2000-02-29 1004.0 0.7
2 2000-03-31 1004.0 1.2
3 2000-04-30 1004.0 2.3
4 2000-05-31 1004.0 0.9
5 2020-12-31 3900.0 1.7
6 2020-12-31 3900.0 2.8
7 2020-12-31 3900.0 3.0
8 2020-12-31 3900.0 0.2
9 2020-12-31 3900.0 2.1
df2
date stock_id be/me
0 2000-06-30 2061.0 0.653684
1 2000-06-30 4383.0 0.053660
2 2000-06-30 13561.0 0.092509
3 2000-06-30 4065.0 1.342187
4 2000-06-30 2731.0 0.235582
5 2000-06-30 7022.0 0.072534
6 2000-06-30 30990.0 1.071096
7 2000-06-30 22867.0 1.627155
8 2000-06-30 15247.0 0.051387
9 2000-06-30 61574.0 1.684690
deciles = pd.qcut(df2['be/me'], 10, labels=False, duplicates = 'drop', retbins=True)
deciles
(0 5
1 1
2 3
3 7
4 4
5 2
6 6
7 8
8 0
9 9
Name: be/me, dtype: int64,
array([0.051387 , 0.0534327, 0.0687592, 0.0865165, 0.1783528, 0.444633 ,
0.8206488, 1.1524233, 1.3991806, 1.6329085, 1.68469 ]))
df.loc[:,'deciles'] = np.digitize(df['be/me'],deciles[1])-1
df
date stock_od be/me deciles
0 2000-01-31 1004.0 0.3 4
1 2000-02-29 1004.0 0.7 5
2 2000-03-31 1004.0 1.2 7
3 2000-04-30 1004.0 2.3 10
4 2000-05-31 1004.0 0.9 6
5 2020-12-31 3900.0 1.7 10
6 2020-12-31 3900.0 2.8 10
7 2020-12-31 3900.0 3.0 10
8 2020-12-31 3900.0 0.2 4
9 2020-12-31 3900.0 2.1 10
With the argument retbins=True
to pd.qcut()
you get a tuple where the second item is an array containing the bin (here decile) edges.
You then apply the very handy numpy function np.digitize()
(https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) to the df
column be/me
, which gives you for each value which bin (decile) it belongs to.
Note: I added a -1
because the numpy function np.digitize()
returns the next decile compared to what pd.qcut()
delivered. Probably because np.digitize()
reserves deciles 0 and 10 for values that fall outside the lower and higher bin edges, respectively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With