I have a dataframe organized by group and datetimes with a corresponding value column. You can think of the value col as the value of the asset A, B, C, etc. (the group col). I am trying to calculate the total gain/loss for each group on each date. Prior to the first entry, the corresponding value is 0. As an example, below between 12/31/2019 and 1/1/2020 0:00 the value of A goes from -43 to 19. Therefore, A's profit between the two dates should be 19-(-43) = 62.
I can achieve this result by:
df.groupby([time_index, group]).sum().groupby(group).diff().groupby(group).cumsum()
This code, however, does not handle when a group is no longer marked or dates prior to a group first being recorded. For example, group D's first entry is on 1/3/2020 0:00 at -13. That means between 12/31/2020 and 1/3/2020 0:00, group D's loss was -13-0=-13 (0 because 1/3/2020 0:00 was the first D entry).
Additionally, suppose a group is no longer recorded (i.e. it has been sold)--for instance, group A after 1/1/2020 9:00, the value should be "front-filled" and take the last recorded group A value and fill it in to later dates. That way, the gain for A between 1/3/2020 18:00 and 12/31/2019 is 123 - (-43) = 166.
To max the code above work, I am looking to back-fill (blue in picture below) and front-fill (green in picture below) a given dataframe. Groups should be backfilled as 0's before their first listing. Groups should be "front-filled" as last recorded value after their last listing.
Here pictorially is what I am trying to accomplish:
Here are sample dataframes and a simple test case I am trying to pass:
import pandas as pd
from pandas import Timestamp
from pandas.util.testing import assert_frame_equal
df = pd.DataFrame({
'datetime': {
0: Timestamp('2019-12-31 00:00:00'),
1: Timestamp('2019-12-31 00:00:00'),
2: Timestamp('2020-01-01 00:00:00'),
3: Timestamp('2020-01-01 00:00:00'),
4: Timestamp('2020-01-01 09:00:00'),
5: Timestamp('2020-01-01 09:00:00'),
6: Timestamp('2020-01-02 00:00:00'),
7: Timestamp('2020-01-02 00:00:00'),
8: Timestamp('2020-01-02 00:00:00'),
9: Timestamp('2020-01-03 00:00:00'),
10: Timestamp('2020-01-03 00:00:00'),
11: Timestamp('2020-01-03 00:00:00'),
12: Timestamp('2020-01-03 18:00:00'),
13: Timestamp('2020-01-03 18:00:00'),
14: Timestamp('2020-01-03 18:00:00')
},
'group': {
0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A', 7: 'B',
8: 'C', 9: 'B', 10: 'C', 11: 'D', 12: 'B', 13: 'C', 14: 'D'
},
'value': {
0: -43, 1: -34, 2: 19, 3: -118, 4: -144, 5: -93, 6: 123, 7: -159,
8: -48, 9: 63, 10: -9, 11: -13, 12: -131, 13: 3, 14: -61
}
})
a1 = pd.DataFrame({
'datetime': {
0: Timestamp('2019-12-31 00:00:00'),
1: Timestamp('2019-12-31 00:00:00'),
2: Timestamp('2019-12-31 00:00:00'),
3: Timestamp('2019-12-31 00:00:00'),
4: Timestamp('2020-01-01 00:00:00'),
5: Timestamp('2020-01-01 00:00:00'),
6: Timestamp('2020-01-01 00:00:00'),
7: Timestamp('2020-01-01 00:00:00'),
8: Timestamp('2020-01-01 09:00:00'),
9: Timestamp('2020-01-01 09:00:00'),
10: Timestamp('2020-01-01 09:00:00'),
11: Timestamp('2020-01-01 09:00:00'),
12: Timestamp('2020-01-02 00:00:00'),
13: Timestamp('2020-01-02 00:00:00'),
14: Timestamp('2020-01-02 00:00:00'),
15: Timestamp('2020-01-02 00:00:00'),
16: Timestamp('2020-01-03 00:00:00'),
17: Timestamp('2020-01-03 00:00:00'),
18: Timestamp('2020-01-03 00:00:00'),
19: Timestamp('2020-01-03 00:00:00'),
20: Timestamp('2020-01-03 18:00:00'),
21: Timestamp('2020-01-03 18:00:00'),
22: Timestamp('2020-01-03 18:00:00'),
23: Timestamp('2020-01-03 18:00:00')
},
'group': {
0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'A', 5: 'B', 6: 'C', 7: 'D',
8: 'A', 9: 'B', 10: 'C', 11: 'D', 12: 'A', 13: 'B', 14: 'C', 15: 'D',
16: 'A', 17: 'B', 18: 'C', 19: 'D', 20: 'A', 21: 'B', 22: 'C', 23: 'D'
},
'value': {
0: -43, 1: -34, 2: 0, 3: 0, 4: 19, 5: -118, 6: 0, 7: 0, 8: -144,
9: -93, 10: 0, 11: 0, 12: 123, 13: -159, 14: -48, 15: 0, 16: 123,
17: 63, 18: -9, 19: -13, 20: 123, 21: -131, 22: 3, 23: -61
}
})
time_index = 'datetime'
group = 'group'
value_col = 'value'
def backfill_frontfill(df):
raise NotImplementedError
my_result = backfill_frontfill(df)
assert_frame_equal(myresult, a1)
pivot
to separate values into columnsinterpolate
for the front fillfillna
for zeros in the back fillstack
and fix columnsdef backfill_frontfill(df):
return df\
.pivot(index="datetime", columns="group", values="value")\
.interpolate()\
.fillna(0)\
.stack().reset_index().rename({0: "value"}, axis='columns')
You may achieve your desired output by creating a multiindex to use for reindex
. Next, groupby and ffill
and fillna(0)
ix = pd.MultiIndex.from_product([df.datetime.unique(), df.group.unique()],
names=['datetime', 'group'])
my_result = (df.set_index(['datetime', 'group']).reindex(ix).groupby(level=1).ffill().fillna(0)
.astype(np.int64).reset_index())
Out[56]:
datetime group value
0 2019-12-31 00:00:00 A -43
1 2019-12-31 00:00:00 B -34
2 2019-12-31 00:00:00 C 0
3 2019-12-31 00:00:00 D 0
4 2020-01-01 00:00:00 A 19
5 2020-01-01 00:00:00 B -118
6 2020-01-01 00:00:00 C 0
7 2020-01-01 00:00:00 D 0
8 2020-01-01 09:00:00 A -144
9 2020-01-01 09:00:00 B -93
10 2020-01-01 09:00:00 C 0
11 2020-01-01 09:00:00 D 0
12 2020-01-02 00:00:00 A 123
13 2020-01-02 00:00:00 B -159
14 2020-01-02 00:00:00 C -48
15 2020-01-02 00:00:00 D 0
16 2020-01-03 00:00:00 A 123
17 2020-01-03 00:00:00 B 63
18 2020-01-03 00:00:00 C -9
19 2020-01-03 00:00:00 D -13
20 2020-01-03 18:00:00 A 123
21 2020-01-03 18:00:00 B -131
22 2020-01-03 18:00:00 C 3
23 2020-01-03 18:00:00 D -61
Method 2: simply use unstack
, stack
together with ffill
and fillna
my_result = (df.set_index(['datetime', 'group']).unstack().ffill()
.stack(dropna=False).fillna(0).astype(np.int64).reset_index())
Out[109]:
datetime group value
0 2019-12-31 00:00:00 A -43
1 2019-12-31 00:00:00 B -34
2 2019-12-31 00:00:00 C 0
3 2019-12-31 00:00:00 D 0
4 2020-01-01 00:00:00 A 19
5 2020-01-01 00:00:00 B -118
6 2020-01-01 00:00:00 C 0
7 2020-01-01 00:00:00 D 0
8 2020-01-01 09:00:00 A -144
9 2020-01-01 09:00:00 B -93
10 2020-01-01 09:00:00 C 0
11 2020-01-01 09:00:00 D 0
12 2020-01-02 00:00:00 A 123
13 2020-01-02 00:00:00 B -159
14 2020-01-02 00:00:00 C -48
15 2020-01-02 00:00:00 D 0
16 2020-01-03 00:00:00 A 123
17 2020-01-03 00:00:00 B 63
18 2020-01-03 00:00:00 C -9
19 2020-01-03 00:00:00 D -13
20 2020-01-03 18:00:00 A 123
21 2020-01-03 18:00:00 B -131
22 2020-01-03 18:00:00 C 3
23 2020-01-03 18:00:00 D -61
Test:
def backfill_frontfill(df):
ix = pd.MultiIndex.from_product([df.datetime.unique(), df.group.unique()],
names=['datetime', 'group'])
return (df.set_index(['datetime', 'group']).reindex(ix).groupby(level=1).ffill().fillna(0)
.astype(np.int64).reset_index())
my_result = backfill_frontfill(df)
print(assert_frame_equal(my_result, a1))
Output:
None
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With