Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

1 Year Rolling mean pandas on column date

I would like to compute the 1-year rolling average for each row in this Dataframe test:

index   id      date        variation
2313    7034    2018-03-14  4.139148e-06
2314    7034    2018-03-13  4.953194e-07
2315    7034    2018-03-12  2.854749e-06
2316    7034    2018-03-09  3.907458e-06
2317    7034    2018-03-08  1.662412e-06
2318    7034    2018-03-07  1.346433e-06
2319    7034    2018-03-06  8.731700e-06
2320    7034    2018-03-05  7.145597e-06
2321    7034    2018-03-02  4.893283e-06
...

For example, I would need to calculate:

  • mean of variation of id 7034 between 2018-03-14 and 2017-08-14
  • mean of variation of id 7034 between 2018-03-13 and 2017-08-13
  • etc.

I tried:

test.groupby(['id','date'])['variation'].rolling(window=1,freq='Y',on='date').mean()

but I got the error message:

ValueError: invalid on specified as date, must be a column (if DataFrame) or None

How can I use the pandas rolling() function in this case?


[EDIT 1] [thanks to Sacul]

I tested:

df['date'] = pd.to_datetime(df['date'])

df.set_index('date').groupby('id').rolling(window=1, freq='Y').mean()['variation']

But freq='Y' doesn't work (I got: ValueError: Invalid frequency: Y) Then I used window = 365, freq = 'D'.

But there is another issue: because there are never 365 consecutive dates for each combined id-date, the result is always empty. Even if there missing dates, I would like to ignore them and consider all dates between the current date and the (current date - 365) to compute the rolling mean. For instance, imagine I have:

index   id      date        variation
2313    7034    2018-03-14  4.139148e-06
2314    7034    2018-03-13  4.953194e-07
2315    7034    2017-03-13  2.854749e-06

Then,

  • for 7034 2018-03-14: I would like to compute MEAN(4.139148e-06,4.953194e-07, 2.854749e-06)
  • for 7034 2018-03-13: I would like to compute also MEAN(4.139148e-06,4.953194e-07, 2.854749e-06)

How can I do that?


[EDIT 2]

Finally I used the formula below to calculate rolling median, averages and standard deviation on 1 Year by ignoring missing values:

pd.rolling_median(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)

pd.rolling_mean(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)

pd.rolling_std(df.set_index('date').groupby('id')['variation'],window=365, freq='D',min_periods=1)
like image 631
Thomas Avatar asked Mar 20 '18 15:03

Thomas


People also ask

How do I find the mean of a column in pandas?

To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.

How does Python calculate rolling average in pandas?

In Python, we can calculate the moving average using . rolling() method. This method provides rolling windows over the data, and we can use the mean function over these windows to calculate moving averages. The size of the window is passed as a parameter in the function .

What is Min_periods in rolling?

The min_periods argument specifies the minimum number of observations in the current window required to generate a rolling value; otherwise, the result is NaN .

What is rolling mean in pandas?

A rolling mean is simply the mean of a certain number of previous periods in a time series. To calculate the rolling mean for one or more columns in a pandas DataFrame, we can use the following syntax: df['column_name']. rolling(rolling_window). mean()


1 Answers

I believe this should work for you:

# First make sure that `date` is a datetime object:

df['date'] = pd.to_datetime(df['date'])

df.set_index('date').groupby('id').rolling(window=1, freq='A').mean()['variation']

using pd.DataFrame.rolling with datetime works well when the date is the index, which is why I used df.set_index('date') (as can be seen in one of the documentation's examples)

I can't really test if it works on the year's average on your example dataframe, as there is only one year and only one ID, but it should work.

Arguably Better Solution:

[EDIT] As pointed out by Mihai-Andrei Dinculescu, freq is now a deprecated argument. Here is an alternative (and probably more future-proof) way to do what you're looking for:

df.set_index('date').groupby('id')['variation'].resample('A').mean()

You can take a look at the resample documentation for more details on how this works, and this link regarding the frequency arguments.

like image 139
sacuL Avatar answered Sep 26 '22 06:09

sacuL