Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply custom cumulative function to pandas dataframe

Tags:

python

pandas

I have a dataframe sorted by date:

df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
                   'date': ['2016-04-30', '2016-05-31', '2016-06-31',
                            '2016-04-30', '2016-05-31', '2016-06-31'],
                   'val': [10, 0, 5, 10, 0, 0],
                   'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort('date')
print df

         date  idx  pct_val  val
3  2016-04-30    2      NaN   10
0  2016-04-30    1      NaN   10
4  2016-05-31    2      -10    0
1  2016-05-31    1      -10    0
5  2016-06-31    2      -10    0
2  2016-06-31    1      NaN    5

And I want to group by idx then apply a cumulative function with some simple logic. If pct_val is null, add val to to running total, otherwise multiply running total by 1 + pct_val/100. 'cumsum' shows the result of df.groupby('idx').val.cumsum() and 'cumulative_func' is the result I want.

         date  idx  pct_val  val  cumsum  cumulative_func
3  2016-04-30    2      NaN   10      10               10
0  2016-04-30    1      NaN   10      10               10
4  2016-05-31    2      -10    0      10                9
1  2016-05-31    1      -10    0      10                9
5  2016-06-31    2      -10    0      10                8
2  2016-06-31    1      NaN    5      15               14

Any idea if there is a way to do apply a custom cumulative function to a dataframe or a better way to achieve this?

like image 667
user2899059 Avatar asked May 17 '16 18:05

user2899059


People also ask

How do you do a cumulative sum in a DataFrame in Python?

The cumsum() method returns a DataFrame with the cumulative sum for each row. The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.

How do I use custom function on Groupby pandas?

Simply use the apply method to each dataframe in the groupby object. This is the most straightforward way and the easiest to understand. Notice that the function takes a dataframe as its only argument, so any code within the custom function needs to work on a pandas dataframe.

Can you use Numba with pandas?

Numba can be used in 2 ways with pandas: Specify the engine="numba" keyword in select pandas methods. Define your own Python function decorated with @jit and pass the underlying NumPy array of Series or DataFrame (using to_numpy() ) into the function.

How do you apply a user defined function to a DataFrame in Python?

There are generally 3 ways to apply custom functions in Pandas: map , apply , and applymap . map works element-wise on a series, and is optimized for mapping values to a series (e.g. one column of a DataFrame). applymap works element-wise on a DataFrame, and is optimized for mapping values to a DataFrame.


2 Answers

I don't believe there is an easy way to accomplish your objective using vectorization. I would first try to get something working, and then optimize for speed if required.

def cumulative_func(df):
    results = []
    for group in df.groupby('idx').groups.itervalues():
        total = 0
        result = []
        for p, v in df.ix[group, ['pct_val', 'val']].values:
            if np.isnan(p):
                total += v
            else:
                total *= (1 + .01 * p)
            result.append(total)
        results.append(pd.Series(result, index=group))
    return pd.concat(results).reindex(df.index)

df['cumulative_func'] = cumulative_func(df)

>>> df
         date  idx  pct_val  val  cumulative_func
3  2016-04-30    2      NaN   10             10.0
0  2016-04-30    1      NaN   10             10.0
4  2016-05-31    2      -10    0              9.0
1  2016-05-31    1      -10    0              9.0
5  2016-06-31    2      -10    0              8.1
2  2016-06-31    1      NaN    5             14.0
like image 191
Alexander Avatar answered Oct 26 '22 19:10

Alexander


First I cleaned up your setup

Setup

df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
                   'date': ['2016-04-30', '2016-05-31', '2016-06-31',
                            '2016-04-30', '2016-05-31', '2016-06-31'],
                   'val': [10, 0, 5, 10, 0, 0],
                   'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort_values(['date', 'idx'])
print df

Looks like:

         date  idx  pct_val  val
0  2016-04-30    1      NaN   10
3  2016-04-30    2      NaN   10
1  2016-05-31    1    -10.0    0
4  2016-05-31    2    -10.0    0
2  2016-06-31    1      NaN    5
5  2016-06-31    2    -10.0    0

Solution

def cumcustom(df):
    df = df.copy()
    running_total = 0
    for idx, row in df.iterrows():
        if pd.isnull(row.ix['pct_val']):
            running_total += row.ix['val']
        else:
            running_total *= row.ix['pct_val'] / 100. + 1
        df.loc[idx, 'cumcustom'] = running_total
    return df

Then apply

df.groupby('idx').apply(cumcustom).reset_index(drop=True).sort_values(['date', 'idx'])

Looks like:

         date  idx  pct_val  val  cumcustom
0  2016-04-30    1      NaN   10       10.0
3  2016-04-30    2      NaN   10       10.0
1  2016-05-31    1    -10.0    0        9.0
4  2016-05-31    2    -10.0    0        9.0
2  2016-06-31    1      NaN    5       14.0
5  2016-06-31    2    -10.0    0        8.1
like image 23
piRSquared Avatar answered Oct 26 '22 18:10

piRSquared