I have a dataframe organized by group and datetimes with a corresponding value column. You can think of the value col as the value of the asset A, B, C, etc. (the group col). I am trying to calculate the total gain/loss for each group on each date. Prior to the first entry, the corresponding value is 0. As an example, below between 12/31/2019 and 1/1/2020 0:00 the value of A goes from -43 to 19. Therefore, A's profit between the two dates should be 19-(-43) = 62.
I can achieve this result by:
df.groupby([time_index, group]).sum().groupby(group).diff().groupby(group).cumsum()
This code, however, does not handle when a group is no longer marked or dates prior to a group first being recorded. For example, group D's first entry is on 1/3/2020 0:00 at -13. That means between 12/31/2020 and 1/3/2020 0:00, group D's loss was -13-0=-13 (0 because 1/3/2020 0:00 was the first D entry).
Additionally, suppose a group is no longer recorded (i.e. it has been sold)--for instance, group A after 1/1/2020 9:00, the value should be "front-filled" and take the last recorded group A value and fill it in to later dates. That way, the gain for A between 1/3/2020 18:00 and 12/31/2019 is 123 - (-43) = 166.
To max the code above work, I am looking to back-fill (blue in picture below) and front-fill (green in picture below) a given dataframe. Groups should be backfilled as 0's before their first listing. Groups should be "front-filled" as last recorded value after their last listing.
Here pictorially is what I am trying to accomplish:
 
Here are sample dataframes and a simple test case I am trying to pass:
import pandas as pd
from pandas import Timestamp
from pandas.util.testing import assert_frame_equal
df = pd.DataFrame({
    'datetime': {
        0: Timestamp('2019-12-31 00:00:00'),
        1: Timestamp('2019-12-31 00:00:00'),
        2: Timestamp('2020-01-01 00:00:00'),
        3: Timestamp('2020-01-01 00:00:00'),
        4: Timestamp('2020-01-01 09:00:00'),
        5: Timestamp('2020-01-01 09:00:00'),
        6: Timestamp('2020-01-02 00:00:00'),
        7: Timestamp('2020-01-02 00:00:00'),
        8: Timestamp('2020-01-02 00:00:00'),
        9: Timestamp('2020-01-03 00:00:00'),
        10: Timestamp('2020-01-03 00:00:00'),
        11: Timestamp('2020-01-03 00:00:00'),
        12: Timestamp('2020-01-03 18:00:00'),
        13: Timestamp('2020-01-03 18:00:00'),
        14: Timestamp('2020-01-03 18:00:00')
    },
    'group': {
        0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A', 7: 'B',
        8: 'C', 9: 'B', 10: 'C', 11: 'D', 12: 'B', 13: 'C', 14: 'D'
    },
    'value': {
        0: -43, 1: -34, 2: 19, 3: -118, 4: -144, 5: -93, 6: 123, 7: -159,
        8: -48, 9: 63, 10: -9, 11: -13, 12: -131, 13: 3, 14: -61
    }
})
a1 = pd.DataFrame({
    'datetime': {
        0: Timestamp('2019-12-31 00:00:00'),
        1: Timestamp('2019-12-31 00:00:00'),
        2: Timestamp('2019-12-31 00:00:00'),
        3: Timestamp('2019-12-31 00:00:00'),
        4: Timestamp('2020-01-01 00:00:00'),
        5: Timestamp('2020-01-01 00:00:00'),
        6: Timestamp('2020-01-01 00:00:00'),
        7: Timestamp('2020-01-01 00:00:00'),
        8: Timestamp('2020-01-01 09:00:00'),
        9: Timestamp('2020-01-01 09:00:00'),
        10: Timestamp('2020-01-01 09:00:00'),
        11: Timestamp('2020-01-01 09:00:00'),
        12: Timestamp('2020-01-02 00:00:00'),
        13: Timestamp('2020-01-02 00:00:00'),
        14: Timestamp('2020-01-02 00:00:00'),
        15: Timestamp('2020-01-02 00:00:00'),
        16: Timestamp('2020-01-03 00:00:00'),
        17: Timestamp('2020-01-03 00:00:00'),
        18: Timestamp('2020-01-03 00:00:00'),
        19: Timestamp('2020-01-03 00:00:00'),
        20: Timestamp('2020-01-03 18:00:00'),
        21: Timestamp('2020-01-03 18:00:00'),
        22: Timestamp('2020-01-03 18:00:00'),
        23: Timestamp('2020-01-03 18:00:00')
    },
    'group': {
        0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'C', 7: 'D',
        8: 'A', 9: 'B', 10: 'C', 11: 'D', 12: 'A', 13: 'B', 14: 'C', 15: 'D',
        16: 'A', 17: 'B', 18: 'C', 19: 'D', 20: 'A', 21: 'B', 22: 'C', 23: 'D'
    },
    'value': {
        0: -43, 1: -34, 2: 0, 3: 0, 4: 19, 5: -118, 6: 0, 7: 0, 8: -144,
        9: -93, 10: 0, 11: 0, 12: 123, 13: -159, 14: -48, 15: 0, 16: 123,
        17: 63, 18: -9, 19: -13, 20: 123, 21: -131, 22: 3, 23: -61
    }
})
time_index = 'datetime'
group = 'group'
value_col = 'value'
def backfill_frontfill(df):
    raise NotImplementedError
my_result = backfill_frontfill(df)
assert_frame_equal(myresult, a1)
                pivot to separate values into columnsinterpolate for the front fillfillna for zeros in the back fillstack and fix columnsdef backfill_frontfill(df):
    return df\
        .pivot(index="datetime", columns="group", values="value")\
        .interpolate()\
        .fillna(0)\
        .stack().reset_index().rename({0: "value"}, axis='columns')
                        You may achieve your desired output by creating a multiindex to use for reindex. Next, groupby and ffill and fillna(0)
ix = pd.MultiIndex.from_product([df.datetime.unique(), df.group.unique()], 
                                names=['datetime', 'group'])
my_result = (df.set_index(['datetime', 'group']).reindex(ix).groupby(level=1).ffill().fillna(0)
               .astype(np.int64).reset_index())
Out[56]:
              datetime group  value
0  2019-12-31 00:00:00     A    -43
1  2019-12-31 00:00:00     B    -34
2  2019-12-31 00:00:00     C      0
3  2019-12-31 00:00:00     D      0
4  2020-01-01 00:00:00     A     19
5  2020-01-01 00:00:00     B   -118
6  2020-01-01 00:00:00     C      0
7  2020-01-01 00:00:00     D      0
8  2020-01-01 09:00:00     A   -144
9  2020-01-01 09:00:00     B    -93
10 2020-01-01 09:00:00     C      0
11 2020-01-01 09:00:00     D      0
12 2020-01-02 00:00:00     A    123
13 2020-01-02 00:00:00     B   -159
14 2020-01-02 00:00:00     C    -48
15 2020-01-02 00:00:00     D      0
16 2020-01-03 00:00:00     A    123
17 2020-01-03 00:00:00     B     63
18 2020-01-03 00:00:00     C     -9
19 2020-01-03 00:00:00     D    -13
20 2020-01-03 18:00:00     A    123
21 2020-01-03 18:00:00     B   -131
22 2020-01-03 18:00:00     C      3
23 2020-01-03 18:00:00     D    -61
Method 2: simply use unstack, stack together with ffill and fillna
my_result = (df.set_index(['datetime', 'group']).unstack().ffill()
               .stack(dropna=False).fillna(0).astype(np.int64).reset_index())
Out[109]:
              datetime group  value
0  2019-12-31 00:00:00     A    -43
1  2019-12-31 00:00:00     B    -34
2  2019-12-31 00:00:00     C      0
3  2019-12-31 00:00:00     D      0
4  2020-01-01 00:00:00     A     19
5  2020-01-01 00:00:00     B   -118
6  2020-01-01 00:00:00     C      0
7  2020-01-01 00:00:00     D      0
8  2020-01-01 09:00:00     A   -144
9  2020-01-01 09:00:00     B    -93
10 2020-01-01 09:00:00     C      0
11 2020-01-01 09:00:00     D      0
12 2020-01-02 00:00:00     A    123
13 2020-01-02 00:00:00     B   -159
14 2020-01-02 00:00:00     C    -48
15 2020-01-02 00:00:00     D      0
16 2020-01-03 00:00:00     A    123
17 2020-01-03 00:00:00     B     63
18 2020-01-03 00:00:00     C     -9
19 2020-01-03 00:00:00     D    -13
20 2020-01-03 18:00:00     A    123
21 2020-01-03 18:00:00     B   -131
22 2020-01-03 18:00:00     C      3
23 2020-01-03 18:00:00     D    -61
Test:
def backfill_frontfill(df):
    ix = pd.MultiIndex.from_product([df.datetime.unique(), df.group.unique()], 
                                    names=['datetime', 'group'])
    return (df.set_index(['datetime', 'group']).reindex(ix).groupby(level=1).ffill().fillna(0)
               .astype(np.int64).reset_index())
my_result = backfill_frontfill(df)
print(assert_frame_equal(my_result, a1))
Output:
None
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With