Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas time series resample, binning seems off

I was answering another question here with something about pandas I thought to know, time series resampling, when I noticed this odd binning.

Let's say I have a dataframe with a daily date range index and a column I want to resample and sum on.

index = pd.date_range(start="1/1/2018", end="31/12/2018") 
df = pd.DataFrame(np.random.randint(100, size=len(index)), 
                  columns=["sales"], index=index)

>>> df.head()
            sales
2018-01-01     66
2018-01-02     18
2018-01-03     45
2018-01-04     92
2018-01-05     76

Now I resample by one month, everything looks fine:

>>>df.resample("1M").sum()

            sales
2018-01-31   1507
2018-02-28   1186
2018-03-31   1382
[...]
2018-11-30   1342
2018-12-31   1337

If I try to resample by more months though binning starts to look off. This is particularly evident with 6M

df.resample("6M").sum()                                                           
            sales
2018-01-31   1507
2018-07-31   8393
2019-01-31   7283

First bin spans just over one month, last bin goes one month to the future. Maybe I have to set closed="left" to get the proper limits:

df.resample("6M", closed="left").sum()                                            
            sales
2018-06-30   8090
2018-12-31   9054
2019-06-30     39

Now I have an extra bin in 2019 with data from 2018-12-31...

Is this working properly? am I missing any option I should set?

EDIT: here's the output I would expect resampling one year in six month intervals, first interval spanning from Jan 1st to Jun 30, second interval spanning from Jul 1st to Dec 31.

df.resample("6M", closed="left").sum()                                            
            sales
2018-06-30   8090
2018-12-31   9093 # 9054 + 39

Note that there's also some doubt here about what it's happening with June 30 data, does it go in the first bin like I would expect or the second? I mean with the last bin it's evident but the same is probably happening in all the bins.

like image 598
filippo Avatar asked Dec 10 '18 05:12

filippo


2 Answers

The M time offset alias implies month end frequency. What you need is 6MS which is an alias for month start frequency:

df.resample('6MS').sum()

resulting in

            sales
2018-01-01   8130
2018-07-01   9563
2019-01-01      0

Also df.groupby(pd.Grouper(freq='6MS')).sum() can be used interchangeably.


For extra clarity you can compare ranges directly:

>>> pd.date_range('2018-01-01', '2018-12-31', freq='6M')
DatetimeIndex(['2018-01-31', '2018-07-31'], dtype='datetime64[ns]', freq='6M')

>>> pd.date_range('2018-01-01', '2018-12-31', freq='6MS')
DatetimeIndex(['2018-01-01', '2018-07-01'], dtype='datetime64[ns]', freq='6MS')
like image 119
ayorgo Avatar answered Nov 20 '22 22:11

ayorgo


Adding np.random.seed(365) to check both our outputs.

print(df.resample("6M", kind='period').sum())

         sales
2018-01   8794
2018-07   9033

would this work for you?

like image 3
Zanshin Avatar answered Nov 20 '22 22:11

Zanshin