Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Frontfill and backfill missing group values in pandas dataframe

Tags:

python

pandas

I have a dataframe organized by group and datetimes with a corresponding value column. You can think of the value col as the value of the asset A, B, C, etc. (the group col). I am trying to calculate the total gain/loss for each group on each date. Prior to the first entry, the corresponding value is 0. As an example, below between 12/31/2019 and 1/1/2020 0:00 the value of A goes from -43 to 19. Therefore, A's profit between the two dates should be 19-(-43) = 62.

I can achieve this result by:

df.groupby([time_index, group]).sum().groupby(group).diff().groupby(group).cumsum()

This code, however, does not handle when a group is no longer marked or dates prior to a group first being recorded. For example, group D's first entry is on 1/3/2020 0:00 at -13. That means between 12/31/2020 and 1/3/2020 0:00, group D's loss was -13-0=-13 (0 because 1/3/2020 0:00 was the first D entry).

Additionally, suppose a group is no longer recorded (i.e. it has been sold)--for instance, group A after 1/1/2020 9:00, the value should be "front-filled" and take the last recorded group A value and fill it in to later dates. That way, the gain for A between 1/3/2020 18:00 and 12/31/2019 is 123 - (-43) = 166.

To max the code above work, I am looking to back-fill (blue in picture below) and front-fill (green in picture below) a given dataframe. Groups should be backfilled as 0's before their first listing. Groups should be "front-filled" as last recorded value after their last listing.

Here pictorially is what I am trying to accomplish:

df a1

Here are sample dataframes and a simple test case I am trying to pass:

import pandas as pd
from pandas import Timestamp
from pandas.util.testing import assert_frame_equal

df = pd.DataFrame({
    'datetime': {
        0: Timestamp('2019-12-31 00:00:00'),
        1: Timestamp('2019-12-31 00:00:00'),
        2: Timestamp('2020-01-01 00:00:00'),
        3: Timestamp('2020-01-01 00:00:00'),
        4: Timestamp('2020-01-01 09:00:00'),
        5: Timestamp('2020-01-01 09:00:00'),
        6: Timestamp('2020-01-02 00:00:00'),
        7: Timestamp('2020-01-02 00:00:00'),
        8: Timestamp('2020-01-02 00:00:00'),
        9: Timestamp('2020-01-03 00:00:00'),
        10: Timestamp('2020-01-03 00:00:00'),
        11: Timestamp('2020-01-03 00:00:00'),
        12: Timestamp('2020-01-03 18:00:00'),
        13: Timestamp('2020-01-03 18:00:00'),
        14: Timestamp('2020-01-03 18:00:00')
    },
    'group': {
        0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A', 7: 'B',
        8: 'C', 9: 'B', 10: 'C', 11: 'D', 12: 'B', 13: 'C', 14: 'D'
    },
    'value': {
        0: -43, 1: -34, 2: 19, 3: -118, 4: -144, 5: -93, 6: 123, 7: -159,
        8: -48, 9: 63, 10: -9, 11: -13, 12: -131, 13: 3, 14: -61
    }
})

a1 = pd.DataFrame({
    'datetime': {
        0: Timestamp('2019-12-31 00:00:00'),
        1: Timestamp('2019-12-31 00:00:00'),
        2: Timestamp('2019-12-31 00:00:00'),
        3: Timestamp('2019-12-31 00:00:00'),
        4: Timestamp('2020-01-01 00:00:00'),
        5: Timestamp('2020-01-01 00:00:00'),
        6: Timestamp('2020-01-01 00:00:00'),
        7: Timestamp('2020-01-01 00:00:00'),
        8: Timestamp('2020-01-01 09:00:00'),
        9: Timestamp('2020-01-01 09:00:00'),
        10: Timestamp('2020-01-01 09:00:00'),
        11: Timestamp('2020-01-01 09:00:00'),
        12: Timestamp('2020-01-02 00:00:00'),
        13: Timestamp('2020-01-02 00:00:00'),
        14: Timestamp('2020-01-02 00:00:00'),
        15: Timestamp('2020-01-02 00:00:00'),
        16: Timestamp('2020-01-03 00:00:00'),
        17: Timestamp('2020-01-03 00:00:00'),
        18: Timestamp('2020-01-03 00:00:00'),
        19: Timestamp('2020-01-03 00:00:00'),
        20: Timestamp('2020-01-03 18:00:00'),
        21: Timestamp('2020-01-03 18:00:00'),
        22: Timestamp('2020-01-03 18:00:00'),
        23: Timestamp('2020-01-03 18:00:00')
    },
    'group': {
        0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'C', 7: 'D',
        8: 'A', 9: 'B', 10: 'C', 11: 'D', 12: 'A', 13: 'B', 14: 'C', 15: 'D',
        16: 'A', 17: 'B', 18: 'C', 19: 'D', 20: 'A', 21: 'B', 22: 'C', 23: 'D'
    },
    'value': {
        0: -43, 1: -34, 2: 0, 3: 0, 4: 19, 5: -118, 6: 0, 7: 0, 8: -144,
        9: -93, 10: 0, 11: 0, 12: 123, 13: -159, 14: -48, 15: 0, 16: 123,
        17: 63, 18: -9, 19: -13, 20: 123, 21: -131, 22: 3, 23: -61
    }
})

time_index = 'datetime'
group = 'group'
value_col = 'value'

def backfill_frontfill(df):
    raise NotImplementedError

my_result = backfill_frontfill(df)

assert_frame_equal(myresult, a1)
like image 379
cpage Avatar asked Mar 03 '23 21:03

cpage


2 Answers

  1. Use pivot to separate values into columns
  2. use interpolate for the front fill
  3. use fillna for zeros in the back fill
  4. return to original format using stack and fix columns
def backfill_frontfill(df):
    return df\
        .pivot(index="datetime", columns="group", values="value")\
        .interpolate()\
        .fillna(0)\
        .stack().reset_index().rename({0: "value"}, axis='columns')
like image 129
H4kor Avatar answered Mar 05 '23 09:03

H4kor


You may achieve your desired output by creating a multiindex to use for reindex. Next, groupby and ffill and fillna(0)

ix = pd.MultiIndex.from_product([df.datetime.unique(), df.group.unique()], 
                                names=['datetime', 'group'])
my_result = (df.set_index(['datetime', 'group']).reindex(ix).groupby(level=1).ffill().fillna(0)
               .astype(np.int64).reset_index())

Out[56]:
              datetime group  value
0  2019-12-31 00:00:00     A    -43
1  2019-12-31 00:00:00     B    -34
2  2019-12-31 00:00:00     C      0
3  2019-12-31 00:00:00     D      0
4  2020-01-01 00:00:00     A     19
5  2020-01-01 00:00:00     B   -118
6  2020-01-01 00:00:00     C      0
7  2020-01-01 00:00:00     D      0
8  2020-01-01 09:00:00     A   -144
9  2020-01-01 09:00:00     B    -93
10 2020-01-01 09:00:00     C      0
11 2020-01-01 09:00:00     D      0
12 2020-01-02 00:00:00     A    123
13 2020-01-02 00:00:00     B   -159
14 2020-01-02 00:00:00     C    -48
15 2020-01-02 00:00:00     D      0
16 2020-01-03 00:00:00     A    123
17 2020-01-03 00:00:00     B     63
18 2020-01-03 00:00:00     C     -9
19 2020-01-03 00:00:00     D    -13
20 2020-01-03 18:00:00     A    123
21 2020-01-03 18:00:00     B   -131
22 2020-01-03 18:00:00     C      3
23 2020-01-03 18:00:00     D    -61

Method 2: simply use unstack, stack together with ffill and fillna

my_result = (df.set_index(['datetime', 'group']).unstack().ffill()
               .stack(dropna=False).fillna(0).astype(np.int64).reset_index())

Out[109]:
              datetime group  value
0  2019-12-31 00:00:00     A    -43
1  2019-12-31 00:00:00     B    -34
2  2019-12-31 00:00:00     C      0
3  2019-12-31 00:00:00     D      0
4  2020-01-01 00:00:00     A     19
5  2020-01-01 00:00:00     B   -118
6  2020-01-01 00:00:00     C      0
7  2020-01-01 00:00:00     D      0
8  2020-01-01 09:00:00     A   -144
9  2020-01-01 09:00:00     B    -93
10 2020-01-01 09:00:00     C      0
11 2020-01-01 09:00:00     D      0
12 2020-01-02 00:00:00     A    123
13 2020-01-02 00:00:00     B   -159
14 2020-01-02 00:00:00     C    -48
15 2020-01-02 00:00:00     D      0
16 2020-01-03 00:00:00     A    123
17 2020-01-03 00:00:00     B     63
18 2020-01-03 00:00:00     C     -9
19 2020-01-03 00:00:00     D    -13
20 2020-01-03 18:00:00     A    123
21 2020-01-03 18:00:00     B   -131
22 2020-01-03 18:00:00     C      3
23 2020-01-03 18:00:00     D    -61

Test:

def backfill_frontfill(df):
    ix = pd.MultiIndex.from_product([df.datetime.unique(), df.group.unique()], 
                                    names=['datetime', 'group'])
    return (df.set_index(['datetime', 'group']).reindex(ix).groupby(level=1).ffill().fillna(0)
               .astype(np.int64).reset_index())


my_result = backfill_frontfill(df)
print(assert_frame_equal(my_result, a1))

Output:
None
like image 40
Andy L. Avatar answered Mar 05 '23 11:03

Andy L.