Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas rolling window & datetime indexes: What does `offset` mean?

The rolling window function pandas.DataFrame.rolling of pandas 0.22 takes a window argument that is described as follows:

window : int, or offset

Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.

If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0

What actually is an offset in this context?

like image 627
ascripter Avatar asked Feb 18 '18 18:02

ascripter


People also ask

What is window in rolling pandas?

Rolling window calculations in PandasThe rolling() function is used to provide rolling window calculations. Syntax: Series.rolling(self, window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None) Parameters: Name.

What is rolling window?

A Rolling window is expressed relative to the delivery date and automatically shifts forward with the passage of time. For example, a customer with a 5-year Rolling window who gets a delivery on May 4, 2015 would receive data covering the period from May 4, 2015 to May 4, 2020.

What is a rolling function?

rolling() function is a very useful function. It Provides rolling window calculations over the underlying data in the given Series object. Syntax: Series.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None) Parameter : window : Size of the moving window.

What is a window in DataFrame?

Window functions allow us to perform an operation with a given row's data and data from another row that is a specified number of rows away — this “number of rows away value” is called the window.


1 Answers

In a nutshell, if you use an offset like "2D" (2 days), pandas will use the datetime info in the index (if available), potentially accounting for any missing rows or irregular frequencies. But if you use a simple int like 2, then pandas will treat the index as a simple integer index [0,1,2,...] and ignore any datetime info in the index.

A simple example should make this clear:

df=pd.DataFrame({'x':range(4)}, 
    index=pd.to_datetime(['1-1-2018','1-2-2018','1-4-2018','1-5-2018']))

            x
2018-01-01  0
2018-01-02  1
2018-01-04  2
2018-01-05  3

Note that (1) the index is a datetime, but also (2) it is missing '2018-01-03'. So if you use a plain integer like 2, rolling will just look at the last two rows, regardless of the datetime value (in a sense it's behaving like iloc[i-1:i] where i is the current row):

df.rolling(2).count()

              x
2018-01-01  1.0
2018-01-02  2.0
2018-01-04  2.0
2018-01-05  2.0

Conversely, if you use an offset of 2 days ('2D'), rolling will use the actual datetime values and accounts for any irregularities in the datetime index.

df.rolling('2D').count()

              x
2018-01-01  1.0
2018-01-02  2.0
2018-01-04  1.0
2018-01-05  2.0

Also note, you need the index to be sorted in ascending order when using a date offset, but it doesn't matter when using a simple integer (since you're just ignoring the index anyway).

like image 166
JohnE Avatar answered Oct 01 '22 23:10

JohnE