I have a Pandas
data frame like this:
test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02',
'2016-04-02','2016-04-03','2016-04-04',
'2016-04-05','2016-04-06','2016-04-06'],
'User' : ['Mike','John','Mike','John','Mike','Mike',
'Mike','Mike','John'],
'Value' : [1,2,1,3,4.5,1,2,3,6]
})
As you can see below, the data set does not have observations for every day necessarily:
Date User Value
0 2016-04-01 Mike 1.0
1 2016-04-01 John 2.0
2 2016-04-02 Mike 1.0
3 2016-04-02 John 3.0
4 2016-04-03 Mike 4.5
5 2016-04-04 Mike 1.0
6 2016-04-05 Mike 2.0
7 2016-04-06 Mike 3.0
8 2016-04-06 John 6.0
I'd like to add a new column which shows the average value for each user for the past n days (in this case n = 2) if at least one day is available, else it would have nan
value. For example, on 2016-04-06
John gets a nan
because he has no data for 2016-04-05
and 2016-04-04
. So the result will be something like this:
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
It seems that I should a combination of group_by
and customized rolling_mean
after reading several posts in the forum, but I couldn't quite figure out how to do it.
To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.
Pandas DataFrame mean() Method The mean() method returns a Series with the mean value of each column. Mean, Median, and Mode: Mean - The average value.
Pandas DataFrame max() Method The max() method returns a Series with the maximum value of each column. By specifying the column axis ( axis='columns' ), the max() method searches column-wise and returns the maximum value for each row.
Code #1 : Convert Pandas dataframe column type from string to datetime format using pd. to_datetime() function.
I think you can use first convert column Date
to_datetime
, then find missing Days
by groupby
with resample
and last apply
rolling
test['Date'] = pd.to_datetime(test['Date'])
df = test.groupby('User').apply(lambda x: x.set_index('Date').resample('1D').first())
print df
User Value
User Date
John 2016-04-01 John 2.0
2016-04-02 John 3.0
2016-04-03 NaN NaN
2016-04-04 NaN NaN
2016-04-05 NaN NaN
2016-04-06 John 6.0
Mike 2016-04-01 Mike 1.0
2016-04-02 Mike 1.0
2016-04-03 Mike 4.5
2016-04-04 Mike 1.0
2016-04-05 Mike 2.0
df1 = df.groupby(level=0)['Value']
.apply(lambda x: x.shift().rolling(min_periods=1,window=2).mean())
.reset_index(name='Value_Average_Past_2_days')
print df1
User Date Value_Average_Past_2_days
0 John 2016-04-01 NaN
1 John 2016-04-02 2.00
2 John 2016-04-03 2.50
3 John 2016-04-04 3.00
4 John 2016-04-05 NaN
5 John 2016-04-06 NaN
6 Mike 2016-04-01 NaN
7 Mike 2016-04-02 1.00
8 Mike 2016-04-03 1.00
9 Mike 2016-04-04 2.75
10 Mike 2016-04-05 2.75
11 Mike 2016-04-06 1.50
print pd.merge(test, df1, on=['Date', 'User'], how='left')
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
n = 2
# Cast your dates as timestamps.
test['Date'] = pd.to_datetime(test.Date)
# Create a daily index spanning the range of the original index.
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')
# Pivot by Dates and Users.
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
>>> df.head(3)
User John Mike
2016-04-01 2 1.0
2016-04-02 3 1.0
2016-04-03 NaN 4.5
# Apply a rolling mean on the above dataframe and reset the index.
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
.reset_index()
.drop_duplicates())
# For Pandas 0.18.0+
df2 = (df.shift().rolling(window=n, min_periods=1).mean()
.reset_index()
.drop_duplicates())
# Melt the result back into the original form.
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
.sort_values(['Date', 'User'])
.reset_index(drop=True))
>>> df3.head()
Date User Value
0 2016-04-01 John NaN
1 2016-04-01 Mike NaN
2 2016-04-02 John 2.0
3 2016-04-02 Mike 1.0
4 2016-04-03 John 2.5
# Merge the results back into the original dataframe.
>>> test.merge(df3, on=['Date', 'User'], how='left',
suffixes=['', '_Average_past_{0}_days'.format(n)])
Date User Value Value_Average_past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
Summary
n = 2
test['Date'] = pd.to_datetime(test.Date)
idx = pd.date_range(test.Date.min(), test.Date.max(), freq='D')
df = test.pivot(index='Date', values='Value', columns='User').reindex(idx)
df2 = (pd.rolling_mean(df.shift(), n, min_periods=1)
.reset_index()
.drop_duplicates())
df3 = (pd.melt(df2, id_vars='Date', value_name='Value')
.sort_values(['Date', 'User'])
.reset_index(drop=True))
test.merge(df3, on=['Date', 'User'], how='left',
suffixes=['', '_Average_past_{0}_days'.format(n)])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With