I have a dataframe sorted by date
:
df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
'date': ['2016-04-30', '2016-05-31', '2016-06-31',
'2016-04-30', '2016-05-31', '2016-06-31'],
'val': [10, 0, 5, 10, 0, 0],
'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort('date')
print df
date idx pct_val val
3 2016-04-30 2 NaN 10
0 2016-04-30 1 NaN 10
4 2016-05-31 2 -10 0
1 2016-05-31 1 -10 0
5 2016-06-31 2 -10 0
2 2016-06-31 1 NaN 5
And I want to group by idx
then apply a cumulative function with some simple logic. If pct_val
is null, add val
to to running total, otherwise multiply running total by 1 + pct_val/100
. 'cumsum'
shows the result of df.groupby('idx').val.cumsum()
and 'cumulative_func'
is the result I want.
date idx pct_val val cumsum cumulative_func
3 2016-04-30 2 NaN 10 10 10
0 2016-04-30 1 NaN 10 10 10
4 2016-05-31 2 -10 0 10 9
1 2016-05-31 1 -10 0 10 9
5 2016-06-31 2 -10 0 10 8
2 2016-06-31 1 NaN 5 15 14
Any idea if there is a way to do apply a custom cumulative function to a dataframe or a better way to achieve this?
The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.
Simply use the apply method to each dataframe in the groupby object. This is the most straightforward way and the easiest to understand. Notice that the function takes a dataframe as its only argument, so any code within the custom function needs to work on a pandas dataframe.
Numba can be used in 2 ways with pandas: Specify the engine="numba" keyword in select pandas methods. Define your own Python function decorated with @jit and pass the underlying NumPy array of Series or DataFrame (using to_numpy() ) into the function.
There are generally 3 ways to apply custom functions in Pandas: map , apply , and applymap . map works element-wise on a series, and is optimized for mapping values to a series (e.g. one column of a DataFrame). applymap works element-wise on a DataFrame, and is optimized for mapping values to a DataFrame.
I don't believe there is an easy way to accomplish your objective using vectorization. I would first try to get something working, and then optimize for speed if required.
def cumulative_func(df):
results = []
for group in df.groupby('idx').groups.itervalues():
total = 0
result = []
for p, v in df.ix[group, ['pct_val', 'val']].values:
if np.isnan(p):
total += v
else:
total *= (1 + .01 * p)
result.append(total)
results.append(pd.Series(result, index=group))
return pd.concat(results).reindex(df.index)
df['cumulative_func'] = cumulative_func(df)
>>> df
date idx pct_val val cumulative_func
3 2016-04-30 2 NaN 10 10.0
0 2016-04-30 1 NaN 10 10.0
4 2016-05-31 2 -10 0 9.0
1 2016-05-31 1 -10 0 9.0
5 2016-06-31 2 -10 0 8.1
2 2016-06-31 1 NaN 5 14.0
First I cleaned up your setup
df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
'date': ['2016-04-30', '2016-05-31', '2016-06-31',
'2016-04-30', '2016-05-31', '2016-06-31'],
'val': [10, 0, 5, 10, 0, 0],
'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort_values(['date', 'idx'])
print df
Looks like:
date idx pct_val val
0 2016-04-30 1 NaN 10
3 2016-04-30 2 NaN 10
1 2016-05-31 1 -10.0 0
4 2016-05-31 2 -10.0 0
2 2016-06-31 1 NaN 5
5 2016-06-31 2 -10.0 0
def cumcustom(df):
df = df.copy()
running_total = 0
for idx, row in df.iterrows():
if pd.isnull(row.ix['pct_val']):
running_total += row.ix['val']
else:
running_total *= row.ix['pct_val'] / 100. + 1
df.loc[idx, 'cumcustom'] = running_total
return df
Then apply
df.groupby('idx').apply(cumcustom).reset_index(drop=True).sort_values(['date', 'idx'])
Looks like:
date idx pct_val val cumcustom
0 2016-04-30 1 NaN 10 10.0
3 2016-04-30 2 NaN 10 10.0
1 2016-05-31 1 -10.0 0 9.0
4 2016-05-31 2 -10.0 0 9.0
2 2016-06-31 1 NaN 5 14.0
5 2016-06-31 2 -10.0 0 8.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With