Based on the pandas documentation from here: Docs And the examples: <pre class="prettyprint"><code>>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64 </code></pre> After resampling: <pre class="prettyprint"><code>>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 </code></pre> In my thoughts, the bins should looks like these after resampling: <pre class="prettyprint"><code>=========bin 01========= 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 =========bin 02========= 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 =========bin 03========= 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 </code></pre> Am I right on this step?? So after <code>.sum</code> I thought it should be like this: <pre class="prettyprint"><code>2000-01-01 00:02:00 3 2000-01-01 00:05:00 12 2000-01-01 00:08:00 21 </code></pre> I just do not understand how it comes out: <code>2000-01-01 00:00:00 0</code> (because <code>label='right'</code>, 2000-01-01 00:00:00 cannot be any right edge of any bins in this case). <code>2000-01-01 00:09:00 15</code> (the label 2000-01-01 00:09:00 even does not exists in the original Series.

Short answer: If you use <code>closed='left'</code> and <code>loffset='2T'</code> then you'll get what you expected: <pre class="prettyprint"><code>series.resample('3T', label='left', closed='left', loffset='2T').sum() 2000-01-01 00:02:00 3 2000-01-01 00:05:00 12 2000-01-01 00:08:00 21 </code></pre> Long answer: (or why the results you got were correct, given the arguments you used) This may not be clear from the documentation, but open and closed in this setting is about strict vs non-strict inequality (e.g. <code><</code> vs <code><=</code>). An example should make this clear. Using an interior interval from your example, this is the difference from changing the value of <code>closed</code>: <pre class="prettyprint"><code>closed='right' => ( 3:00, 6:00 ] or 3:00 < x <= 6:00 closed='left' => [ 3:00, 6:00 ) or 3:00 <= x < 6:00 </code></pre> You can find an explanation of the interval notation (parentheses vs brackets) in many places like here, for example: https://en.wikipedia.org/wiki/Interval_(mathematics) The <code>label</code> parameter merely controls whether the left (3:00) or right (6:00) side is displayed, but doesn't impact the results themselves. Also note that you can change the starting point for the intervals with the <code>loffset</code> parameter (which should be entered as a time delta). Back to the example, where we change only the labeling from 'right' to 'left': <pre class="prettyprint"><code>series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 series.resample('3T', label='left', closed='right').sum() 1999-12-31 23:57:00 0 2000-01-01 00:00:00 6 2000-01-01 00:03:00 15 2000-01-01 00:06:00 15 </code></pre> As you can see, the results are the same, only the index label changes. Pandas only lets you display the right or left label, but if it showed both, then it would look like this (below I'm using standard index notation where <code>(</code> on the left side means open and <code>]</code> on the right side means closed): <pre class="prettyprint"><code>( 1999-12-31 23:57:00, 2000-01-01 00:00:00 ] 0 # = 0 ( 2000-01-01 00:00:00, 2000-01-01 00:03:00 ] 6 # = 1+2+3 ( 2000-01-01 00:03:00, 2000-01-01 00:06:00 ] 15 # = 4+5+6 ( 2000-01-01 00:06:00, 2000-01-01 00:09:00 ] 15 # = 7+8 </code></pre> Note that the first bin (23:57:00,00:00:00] is not empty, it's just that it contains a single row and the value in that single row is zero. If you change 'sum' to 'count' this becomes more obvious: <pre class="prettyprint"><code>series.resample('3T', label='left', closed='right').count() 1999-12-31 23:57:00 1 2000-01-01 00:00:00 3 2000-01-01 00:03:00 3 2000-01-01 00:06:00 2 </code></pre>

Per JohnE's answer I put together a little helpful infographic which should settle this issue once and for all: <img src="https://i.stack.imgur.com/nX6yv.png" alt="enter image description here">

how to understand closed and label arguments in pandas resample method?

Tags:

python

pandas

dataframe

time-series

Based on the pandas documentation from here: Docs

And the examples:

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

After resampling:

>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15

In my thoughts, the bins should looks like these after resampling:

=========bin 01=========
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2

=========bin 02=========
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5

=========bin 03=========
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8

Am I right on this step??

So after .sum I thought it should be like this:

2000-01-01 00:02:00     3
2000-01-01 00:05:00    12
2000-01-01 00:08:00    21

I just do not understand how it comes out:

2000-01-01 00:00:00 0

(because label='right', 2000-01-01 00:00:00 cannot be any right edge of any bins in this case).

2000-01-01 00:09:00 15

(the label 2000-01-01 00:09:00 even does not exists in the original Series.

447

asked Jan 19 '18 11:01

mingchau

2 Answers

Short answer: If you use closed='left' and loffset='2T' then you'll get what you expected:

series.resample('3T', label='left', closed='left', loffset='2T').sum()

2000-01-01 00:02:00     3
2000-01-01 00:05:00    12
2000-01-01 00:08:00    21

Long answer: (or why the results you got were correct, given the arguments you used) This may not be clear from the documentation, but open and closed in this setting is about strict vs non-strict inequality (e.g. < vs <=).

An example should make this clear. Using an interior interval from your example, this is the difference from changing the value of closed:

closed='right' =>  ( 3:00, 6:00 ]  or  3:00 <  x <= 6:00
closed='left'  =>  [ 3:00, 6:00 )  or  3:00 <= x <  6:00

You can find an explanation of the interval notation (parentheses vs brackets) in many places like here, for example: https://en.wikipedia.org/wiki/Interval_(mathematics)

The label parameter merely controls whether the left (3:00) or right (6:00) side is displayed, but doesn't impact the results themselves.

Also note that you can change the starting point for the intervals with the loffset parameter (which should be entered as a time delta).

Back to the example, where we change only the labeling from 'right' to 'left':

series.resample('3T', label='right', closed='right').sum()

2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15

series.resample('3T', label='left', closed='right').sum()

1999-12-31 23:57:00     0
2000-01-01 00:00:00     6
2000-01-01 00:03:00    15
2000-01-01 00:06:00    15

As you can see, the results are the same, only the index label changes. Pandas only lets you display the right or left label, but if it showed both, then it would look like this (below I'm using standard index notation where ( on the left side means open and ] on the right side means closed):

( 1999-12-31 23:57:00, 2000-01-01 00:00:00 ]   0   # = 0
( 2000-01-01 00:00:00, 2000-01-01 00:03:00 ]   6   # = 1+2+3
( 2000-01-01 00:03:00, 2000-01-01 00:06:00 ]  15   # = 4+5+6
( 2000-01-01 00:06:00, 2000-01-01 00:09:00 ]  15   # =   7+8

Note that the first bin (23:57:00,00:00:00] is not empty, it's just that it contains a single row and the value in that single row is zero. If you change 'sum' to 'count' this becomes more obvious:

series.resample('3T', label='left', closed='right').count()

1999-12-31 23:57:00    1
2000-01-01 00:00:00    3
2000-01-01 00:03:00    3
2000-01-01 00:06:00    2

112

answered Oct 14 '22 23:10

JohnE

Per JohnE's answer I put together a little helpful infographic which should settle this issue once and for all:

enter image description here

answered Oct 14 '22 23:10

Molecool

Related questions
                            
                                PySpark: when function with multiple outputs [duplicate]
                            
                                gspread authentication throwing insufficient permission
                            
                                module 'pip' has no attribute 'pep425tags'
                            
                                pipenv : how to force virtualenv directory?
                            
                                How to convert Numpy array to Panda DataFrame
                            
                                AzureBlob Upload ERROR:The specified blob already exists
                            
                                Python's timedelta: can't I just get in whatever time unit I want the value of the entire difference?
                            
                                Python truncate lines as they are read
                            
                                Django Forms: Foreign Key in Hidden Field
                            
                                Parsing a stdout in Python
                            
                                Website stress test in Python - Django
                            
                                how to output every line in a file python
                            
                                datetime.strptime () throws 'does not match format' error
                            
                                Scrapy image download how to use custom filename
                            
                                Slow performance of POS tagging. Can I do some kind of pre-warming?
                            
                                How to make a RadioField in Flask?
                            
                                How do I create child windows with Python tkinter?
                            
                                rename elements in a column of a data frame using pandas
                            
                                Compress Python Object in Memory
                            
                                SqlAlchemy group_by and return max date

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With