Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I assign values from a dataframe to deciles created in another dataframe?

I am working with two dataframes:

  • df contains a column be/me for stocks for a 20-year period (on a monthly basis).
  • df2, a subset of df (with only certain stocks, only for June) contains the column decile, created via the pd.qcut() method for every year in the 20-year period based on an altered version of df's be/me.

Considering the deciles that I created in df2, I wonder if it's possible to rank df's be/me based on df2's decile column. In other words, I wonder if it's possible to assign df's be/me values to the deciles created in df2.

Please see dataframes below for a better understanding of the issue:

             
df

date         stock_id      be/me    
2000-01-31    1004.0        0.3      
2000-02-29    1004.0        0.7 
2000-03-31    1004.0        1.2 
2000-04-30    1004.0        2.3 
2000-05-31    1004.0        0.9 
...            ...          ...
2020-12-31    3900.0        1.7
2020-12-31    3900.0        2.8
2020-12-31    3900.0        3.0
2020-12-31    3900.0        0.2
2020-12-31    3900.0        2.1

1218855 rows × 3 columns


df2['deciles'] = df2.groupby('date')['be/me'].transform(lambda x: pd.qcut(x, 10, labels=False, duplicates = 'drop'))
df2

date        stock_id      be/me  deciles                    
2000-06-30  2061.0      0.653684    5
2000-06-30  4383.0      0.053660    2
2000-06-30  13561.0     0.092509    2
2000-06-30  4065.0      1.342187    6
2000-06-30  2731.0      0.235582    3
  ...         ...          ...     ...
2020-06-30  7022.0      0.072534    2
2020-06-30  30990.0     1.071096    6
2020-06-30  22867.0     1.627155    6
2020-06-30  15247.0     0.051387    2
2020-06-30  61574.0     1.684690    6

24095 rows × 4 columns

Note: date is of type datetime and, for each date, there are multiple stocks (stock_id). Thank you so much for your time.

EDIT

What I want to do is to check in which df2-created decile the original be/me values (from the original dataframe df) fit. The expected output should be a new column in df with df2-created deciles attributed to each and every be/me value in df. Please let me know if there is any additional clarification necessary.

I created a function that loops through the deciles to fetch the maximum decile value for every date in df2. Not sure if I am heading in the right direction since the output is an array with no date... take a look below:

In: def attribution(deciles,dates):
    deciles = df2['deciles'].unique()
    dates = df2.index.unique()
    body_max = []
    body_min = []
    for x in deciles:
        for y in dates:
            body_max.append(df2[df2['deciles'] == x].loc[y]['be/me'].max())
            body_min.append(df2[df2['deciles'] == x].loc[y]['be/me'].min())
            
    return body_max, body_min

In: attribution(deciles, dates)
Out: [0.9343106070197438,
 1.2747264875802489,
 1.9700461181925901,
 0.7888946814157697,
 0.9304702071896337,
 0.9651423313922733,
 0.7238677612487585,
 1.0358317574924074,
 ...]
like image 805
Gonçalo Franchini Avatar asked Nov 05 '22 23:11

Gonçalo Franchini


1 Answers

To be clear: you want to know for each be/me value in df which decile it would have fallen into if that value had been in df2? I see two cases:

  1. If df2 covers the whole month of June (as you wrote), I am afraid there is no answer to that question: each day in the month will have decile bins with different edges (since you are doing a groupby('date') on df2). The same be/me value in df could belong to different deciles in df2 depending on the day in June you consider.

  2. If df2 actually covers only one day in June (as your example above seems to indicate: 2020-06-30), then you have one well defined set of decile bins.

In case 2), you could do that:

df
        date  stock_od  be/me
0 2000-01-31    1004.0    0.3
1 2000-02-29    1004.0    0.7
2 2000-03-31    1004.0    1.2
3 2000-04-30    1004.0    2.3
4 2000-05-31    1004.0    0.9
5 2020-12-31    3900.0    1.7
6 2020-12-31    3900.0    2.8
7 2020-12-31    3900.0    3.0
8 2020-12-31    3900.0    0.2
9 2020-12-31    3900.0    2.1

df2
        date  stock_id     be/me
0 2000-06-30    2061.0  0.653684
1 2000-06-30    4383.0  0.053660
2 2000-06-30   13561.0  0.092509
3 2000-06-30    4065.0  1.342187
4 2000-06-30    2731.0  0.235582
5 2000-06-30    7022.0  0.072534
6 2000-06-30   30990.0  1.071096
7 2000-06-30   22867.0  1.627155
8 2000-06-30   15247.0  0.051387
9 2000-06-30   61574.0  1.684690

deciles = pd.qcut(df2['be/me'], 10, labels=False, duplicates = 'drop', retbins=True)

deciles
(0    5
 1    1
 2    3
 3    7
 4    4
 5    2
 6    6
 7    8
 8    0
 9    9
 Name: be/me, dtype: int64,
 array([0.051387 , 0.0534327, 0.0687592, 0.0865165, 0.1783528, 0.444633 ,
        0.8206488, 1.1524233, 1.3991806, 1.6329085, 1.68469  ]))

df.loc[:,'deciles'] = np.digitize(df['be/me'],deciles[1])-1

df
        date  stock_od  be/me  deciles
0 2000-01-31    1004.0    0.3        4
1 2000-02-29    1004.0    0.7        5
2 2000-03-31    1004.0    1.2        7
3 2000-04-30    1004.0    2.3       10
4 2000-05-31    1004.0    0.9        6
5 2020-12-31    3900.0    1.7       10
6 2020-12-31    3900.0    2.8       10
7 2020-12-31    3900.0    3.0       10
8 2020-12-31    3900.0    0.2        4
9 2020-12-31    3900.0    2.1       10

With the argument retbins=True to pd.qcut() you get a tuple where the second item is an array containing the bin (here decile) edges. You then apply the very handy numpy function np.digitize() (https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) to the df column be/me, which gives you for each value which bin (decile) it belongs to.

Note: I added a -1 because the numpy function np.digitize() returns the next decile compared to what pd.qcut() delivered. Probably because np.digitize() reserves deciles 0 and 10 for values that fall outside the lower and higher bin edges, respectively.

like image 182
k07b470n Avatar answered Nov 12 '22 12:11

k07b470n