I have a dataframe which looks like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16]})
Out[40]:
category order_start time
0 1 1 1
1 1 2 4
2 1 3 3
3 2 1 6
4 2 2 8
5 2 3 17
6 3 1 14
7 3 2 12
8 3 3 13
9 4 1 16
I would like to create a new column which contains the mean of the previous times of the same category. How can I create it ?
The new column should look like this:
pd.DataFrame({'category': [1,1,1,2,2,2,3,3,3,4],
'order_start': [1,2,3,1,2,3,1,2,3,1],
'time': [1, 4, 3, 6, 8, 17, 14, 12, 13, 16],
'mean': [np.nan, 1, 2.5, np.nan, 6, 7, np.nan, 14, 13, np.nan]})
Out[41]:
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0 = 1 / 1
2 1 3 3 2.5 = (4+1)/2
3 2 1 6 NaN
4 2 2 8 6.0 = 6 / 1
5 2 3 17 7.0 = (8+6) / 2
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Note: If it is the first time, the mean should be NaN.
EDIT: as stated by cs95, my question was not really the same as this one since here, expanding is required.
Groupby mean in pandas dataframe python Groupby mean in pandas python can be accomplished by groupby () function. Groupby mean of multiple column and single column in pandas is accomplished by multiple ways some among them are groupby () function and aggregate () function. let’s see how to Groupby single column in pandas – groupby mean
Groupby mean of multiple column and single column in pandas is accomplished by multiple ways some among them are groupby () function and aggregate () function. let’s see how to Groupby single column in pandas – groupby mean Groupby multiple columns in pandas – groupby mean Groupby mean using pivot () function.
Similar to the SQL GROUP BY statement, the Pandas method works by splitting our data, aggregating it in a given way (or ways), and re-combining the data in a meaningful way. Because the .groupby () method works by first splitting the data, we can actually work with the groups directly.
Groupby mean of multiple column and single column in pandas is accomplished by multiple ways some among them are groupby () function and aggregate () function. let’s see how to. Groupby single column in pandas – groupby mean. Groupby multiple columns in pandas – groupby mean. Groupby mean using pivot () function.
"create a new column which contains the mean of the previous times of the same category" sounds like a good use case for GroupBy.expanding
(and a shift):
df['mean'] = (
df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
Another way to calculate this is without the apply
(chaining two groupby
calls):
df['mean'] = (
df.groupby('category')['time']
.shift()
.groupby(df['category'])
.expanding()
.mean()
.to_numpy()) # replace to_numpy() with `.values` for pd.__version__ < 0.24
df
category order_start time mean
0 1 1 1 NaN
1 1 2 4 1.0
2 1 3 3 2.5
3 2 1 6 NaN
4 2 2 8 6.0
5 2 3 17 7.0
6 3 1 14 NaN
7 3 2 12 14.0
8 3 3 13 13.0
9 4 1 16 NaN
In terms of performance, it really depends on the number and size of your groups.
Inspired by my answer here, one can define a function first:
def mean_previous(df, Category, Order, Var):
# Order the dataframe first
df.sort_values([Category, Order], inplace=True)
# Calculate the ordinary grouped cumulative sum
# and then substract with the grouped cumulative sum of the last order
csp = df.groupby(Category)[Var].cumsum() - df.groupby([Category, Order])[Var].cumsum()
# Calculate the ordinary grouped cumulative count
# and then substract with the grouped cumulative count of the last order
ccp = df.groupby(Category)[Var].cumcount() - df.groupby([Category, Order]).cumcount()
return csp / ccp
And the desired column is
df['mean'] = mean_previous(df, 'category', 'order_start', 'time')
Performance-wise, I believe it's very fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With