Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to understand closed and label arguments in pandas resample method?

Based on the pandas documentation from here: Docs

And the examples:

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

After resampling:

>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15

In my thoughts, the bins should looks like these after resampling:

=========bin 01=========
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2

=========bin 02=========
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5

=========bin 03=========
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8

Am I right on this step??

So after .sum I thought it should be like this:

2000-01-01 00:02:00     3
2000-01-01 00:05:00    12
2000-01-01 00:08:00    21

I just do not understand how it comes out:

2000-01-01 00:00:00 0

(because label='right', 2000-01-01 00:00:00 cannot be any right edge of any bins in this case).

2000-01-01 00:09:00 15

(the label 2000-01-01 00:09:00 even does not exists in the original Series.

like image 447
mingchau Avatar asked Jan 19 '18 11:01

mingchau


People also ask

What does resample do Pandas?

Pandas Series: resample() function The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

How does Python resample work?

As previously mentioned, resample() is a method of pandas dataframes that can be used to summarize data by date or time. The . sum() method will add up all values for each resampling period (e.g. for each day) to provide a summary output value for that period.

How do you convert hourly data to daily data in Python?

But assuming you have a start point it becomes simple, take the hours for each start point and divide by 24, you'll get the days and leftover hours, and then add them to your start date and this way you'll get date and time.


2 Answers

Short answer: If you use closed='left' and loffset='2T' then you'll get what you expected:

series.resample('3T', label='left', closed='left', loffset='2T').sum()

2000-01-01 00:02:00     3
2000-01-01 00:05:00    12
2000-01-01 00:08:00    21

Long answer: (or why the results you got were correct, given the arguments you used) This may not be clear from the documentation, but open and closed in this setting is about strict vs non-strict inequality (e.g. < vs <=).

An example should make this clear. Using an interior interval from your example, this is the difference from changing the value of closed:

closed='right' =>  ( 3:00, 6:00 ]  or  3:00 <  x <= 6:00
closed='left'  =>  [ 3:00, 6:00 )  or  3:00 <= x <  6:00

You can find an explanation of the interval notation (parentheses vs brackets) in many places like here, for example: https://en.wikipedia.org/wiki/Interval_(mathematics)

The label parameter merely controls whether the left (3:00) or right (6:00) side is displayed, but doesn't impact the results themselves.

Also note that you can change the starting point for the intervals with the loffset parameter (which should be entered as a time delta).

Back to the example, where we change only the labeling from 'right' to 'left':

series.resample('3T', label='right', closed='right').sum()

2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15

series.resample('3T', label='left', closed='right').sum()

1999-12-31 23:57:00     0
2000-01-01 00:00:00     6
2000-01-01 00:03:00    15
2000-01-01 00:06:00    15

As you can see, the results are the same, only the index label changes. Pandas only lets you display the right or left label, but if it showed both, then it would look like this (below I'm using standard index notation where ( on the left side means open and ] on the right side means closed):

( 1999-12-31 23:57:00, 2000-01-01 00:00:00 ]   0   # = 0
( 2000-01-01 00:00:00, 2000-01-01 00:03:00 ]   6   # = 1+2+3
( 2000-01-01 00:03:00, 2000-01-01 00:06:00 ]  15   # = 4+5+6
( 2000-01-01 00:06:00, 2000-01-01 00:09:00 ]  15   # =   7+8

Note that the first bin (23:57:00,00:00:00] is not empty, it's just that it contains a single row and the value in that single row is zero. If you change 'sum' to 'count' this becomes more obvious:

series.resample('3T', label='left', closed='right').count()

1999-12-31 23:57:00    1
2000-01-01 00:00:00    3
2000-01-01 00:03:00    3
2000-01-01 00:06:00    2
like image 112
JohnE Avatar answered Oct 14 '22 23:10

JohnE


Per JohnE's answer I put together a little helpful infographic which should settle this issue once and for all:

enter image description here

like image 25
Molecool Avatar answered Oct 14 '22 23:10

Molecool