Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas GroupBy and cumulative mean of previous rows in group

I have a dataframe which looks like this:

pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
              'order_start': [1,2,3,1,2,3,1,2,3,1],
              'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]: 
   category  order_start  time
0         1            1     1
1         1            2     4
2         1            3     3
3         2            1     6
4         2            2     8
5         2            3    17
6         3            1    14
7         3            2    12
8         3            3    13
9         4            1    16

I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?

The new column should look like this:

pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
              'order_start': [1,2,3,1,2,3,1,2,3,1],
              'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
              'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]: 
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0    = 1 / 1
2         1            3     3   2.5    = (4+1)/2
3         2            1     6   NaN
4         2            2     8   6.0    = 6 / 1
5         2            3    17   7.0    = (8+6) / 2
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN

Note: If it is the first time, the mean should be NaN.

EDIT: as stated by cs95, my question was not really the same as this one since here, expanding is required.

like image 478
qwertzuiop Avatar asked Jun 27 '19 22:06

qwertzuiop


People also ask

How to group by mean in pandas Dataframe?

Groupby mean in pandas dataframe python Groupby mean in pandas python can be accomplished by groupby () function. Groupby mean of multiple column and single column in pandas is accomplished by multiple ways some among them are groupby () function and aggregate () function. let’s see how to Groupby single column in pandas – groupby mean

How to group by multiple columns and single column in pandas?

Groupby mean of multiple column and single column in pandas is accomplished by multiple ways some among them are groupby () function and aggregate () function. let’s see how to Groupby single column in pandas – groupby mean Groupby multiple columns in pandas – groupby mean Groupby mean using pivot () function.

How does groupby work in pandas?

Similar to the SQL GROUP BY statement, the Pandas method works by splitting our data, aggregating it in a given way (or ways), and re-combining the data in a meaningful way. Because the .groupby () method works by first splitting the data, we can actually work with the groups directly.

How to group by mean of multiple columns and single column?

Groupby mean of multiple column and single column in pandas is accomplished by multiple ways some among them are groupby () function and aggregate () function. let’s see how to. Groupby single column in pandas – groupby mean. Groupby multiple columns in pandas – groupby mean. Groupby mean using pivot () function.


2 Answers

"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding (and a shift):

df['mean'] = (
    df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0
2         1            3     3   2.5
3         2            1     6   NaN
4         2            2     8   6.0
5         2            3    17   7.0
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN

Another way to calculate this is without the apply (chaining two groupby calls):

df['mean'] = (
    df.groupby('category')['time']
      .shift()
      .groupby(df['category'])
      .expanding()
      .mean()
      .to_numpy())  # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
   category  order_start  time  mean
0         1            1     1   NaN
1         1            2     4   1.0
2         1            3     3   2.5
3         2            1     6   NaN
4         2            2     8   6.0
5         2            3    17   7.0
6         3            1    14   NaN
7         3            2    12  14.0
8         3            3    13  13.0
9         4            1    16   NaN

In terms of performance, it really depends on the number and size of your groups.

like image 179
cs95 Avatar answered Oct 26 '22 23:10

cs95


Inspired by my answer here, one can define a function first:

def mean_previous(df, Category, Order, Var):
    # Order the dataframe first 
    df.sort_values([Category, Order], inplace=True)

    # Calculate the ordinary grouped cumulative sum 
    # and then substract with the grouped cumulative sum of the last order
    csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()

    # Calculate the ordinary grouped cumulative count 
    # and then substract with the grouped cumulative count of the last order
    ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()

    return csp / ccp

And the desired column is

df['mean'] = mean_previous(df, 'category', 'order_start', 'time')

Performance-wise, I believe it's very fast.

like image 35
Anastasiya-Romanova 秀 Avatar answered Oct 27 '22 01:10

Anastasiya-Romanova 秀