Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to resample a df with datetime index to exactly n equally sized periods?

Tags:

python

pandas

I've got a large dataframe with a datetime index and need to resample data to exactly 10 equally sized periods.

So far, I've tried finding the first and last dates to determine the total number of days in the data, divide that by 10 to determine the size of each period, then resample using that number of days. eg:

first = df.reset_index().timesubmit.min()
last = df.reset_index().timesubmit.max()
periodsize = str((last-first).days/10) + 'D'

df.resample(periodsize,how='sum')

This doesn't guarantee exactly 10 periods in the df after resampling since the periodsize is a rounded down int. Using a float doesn't work in the resampling. Seems that either there's something simple that I'm missing here, or I'm attacking the problem all wrong.

like image 917
j.k Avatar asked Jul 03 '15 17:07

j.k


2 Answers

import numpy as np
import pandas as pd

n = 10
nrows = 33
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
print(df)
#             0
# 2000-01-01  1
# 2000-01-02  1
# ...
# 2000-02-01  1
# 2000-02-02  1

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')
secs = int((last-first).total_seconds()//n)
periodsize = '{:d}S'.format(secs)

result = df.resample(periodsize, how='sum')
print('\n{}'.format(result))
assert len(result) == n

yields

                     0
2000-01-01 00:00:00  4
2000-01-04 07:12:00  3
2000-01-07 14:24:00  3
2000-01-10 21:36:00  4
2000-01-14 04:48:00  3
2000-01-17 12:00:00  3
2000-01-20 19:12:00  4
2000-01-24 02:24:00  3
2000-01-27 09:36:00  3
2000-01-30 16:48:00  3

The values in the 0-column indicate the number of rows that were aggregated, since the original DataFrame was filled with values of 1. The pattern of 4's and 3's is about as even as you can get since 33 rows can not be evenly grouped into 10 groups.


Explanation: Consider this simpler DataFrame:

n = 2
nrows = 5
index = pd.date_range('2000-1-1', periods=nrows, freq='D')
df = pd.DataFrame(np.ones(nrows), index=index)
#             0
# 2000-01-01  1
# 2000-01-02  1
# 2000-01-03  1
# 2000-01-04  1
# 2000-01-05  1

Using df.resample('2D', how='sum') gives the wrong number of groups

In [366]: df.resample('2D', how='sum')
Out[366]: 
            0
2000-01-01  2
2000-01-03  2
2000-01-05  1

Using df.resample('3D', how='sum') gives the right number of groups, but the second group starts at 2000-01-04 which does not evenly divide the DataFrame into two equally-spaced groups:

In [367]: df.resample('3D', how='sum')
Out[367]: 
            0
2000-01-01  3
2000-01-04  2

To do better, we need to work at a finer time resolution than in days. Since Timedeltas have a total_seconds method, let's work in seconds. So for the example above, the desired frequency string would be

In [374]: df.resample('216000S', how='sum')
Out[374]: 
                     0
2000-01-01 00:00:00  3
2000-01-03 12:00:00  2

since there are 216000*2 seconds in 5 days:

In [373]: (pd.Timedelta(days=5) / pd.Timedelta('1S'))/2
Out[373]: 216000.0

Okay, so now all we need is a way to generalize this. We'll want the minimum and maximum dates in the index:

first = df.index.min()
last = df.index.max() + pd.Timedelta('1D')

We add an extra day because it makes the difference in days come out right. In the example above, There are only 4 days between the Timestamps for 2000-01-05 and 2000-01-01,

In [377]: (pd.Timestamp('2000-01-05')-pd.Timestamp('2000-01-01')).days
Out[378]: 4

But as we can see in the worked example, the DataFrame has 5 rows representing 5 days. So it makes sense that we need to add an extra day.

Now we can compute the correct number of seconds in each equally-spaced group with:

secs = int((last-first).total_seconds()//n)
like image 195
unutbu Avatar answered Nov 01 '22 09:11

unutbu


Here is one way to ensure equal-size sub-periods by using np.linspace() on pd.Timedelta and then classifying each obs into different bins using pd.cut.

import pandas as pd
import numpy as np

# generate artificial data
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H'))

Out[87]: 
                          A       B
2015-01-01 00:00:00  1.7641  0.4002
2015-01-01 08:00:00  0.9787  2.2409
2015-01-01 16:00:00  1.8676 -0.9773
2015-01-02 00:00:00  0.9501 -0.1514
2015-01-02 08:00:00 -0.1032  0.4106
2015-01-02 16:00:00  0.1440  1.4543
2015-01-03 00:00:00  0.7610  0.1217
2015-01-03 08:00:00  0.4439  0.3337
2015-01-03 16:00:00  1.4941 -0.2052
2015-01-04 00:00:00  0.3131 -0.8541
2015-01-04 08:00:00 -2.5530  0.6536
2015-01-04 16:00:00  0.8644 -0.7422
2015-01-05 00:00:00  2.2698 -1.4544
2015-01-05 08:00:00  0.0458 -0.1872
2015-01-05 16:00:00  1.5328  1.4694
...                     ...     ...
2015-01-29 08:00:00  0.9209  0.3187
2015-01-29 16:00:00  0.8568 -0.6510
2015-01-30 00:00:00 -1.0342  0.6816
2015-01-30 08:00:00 -0.8034 -0.6895
2015-01-30 16:00:00 -0.4555  0.0175
2015-01-31 00:00:00 -0.3540 -1.3750
2015-01-31 08:00:00 -0.6436 -2.2234
2015-01-31 16:00:00  0.6252 -1.6021
2015-02-01 00:00:00 -1.1044  0.0522
2015-02-01 08:00:00 -0.7396  1.5430
2015-02-01 16:00:00 -1.2929  0.2671
2015-02-02 00:00:00 -0.0393 -1.1681
2015-02-02 08:00:00  0.5233 -0.1715
2015-02-02 16:00:00  0.7718  0.8235
2015-02-03 00:00:00  2.1632  1.3365

[100 rows x 2 columns]


# cutoff points, 10 equal-size group requires 11 points
# measured by timedelta 1 hour
time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h')
n = 10
ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1)
# labels, time index
time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff])

# create a categorical reference variables
df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1])
# for clarity, reassign labels using end-period index
df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:])

Out[89]: 
                          A       B    start_time_index      end_time_index
2015-01-01 00:00:00  1.7641  0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 08:00:00  0.9787  2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-01 16:00:00  1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 00:00:00  0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 08:00:00 -0.1032  0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-02 16:00:00  0.1440  1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 00:00:00  0.7610  0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 08:00:00  0.4439  0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-03 16:00:00  1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 00:00:00  0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00
2015-01-04 08:00:00 -2.5530  0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-04 16:00:00  0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 00:00:00  2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 08:00:00  0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00
2015-01-05 16:00:00  1.5328  1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00
...                     ...     ...                 ...                 ...
2015-01-29 08:00:00  0.9209  0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-29 16:00:00  0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 00:00:00 -1.0342  0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-30 16:00:00 -0.4555  0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00
2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-01-31 16:00:00  0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 00:00:00 -1.1044  0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 08:00:00 -0.7396  1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-01 16:00:00 -1.2929  0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 08:00:00  0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-02 16:00:00  0.7718  0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00
2015-02-03 00:00:00  2.1632  1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00

[100 rows x 4 columns]

df.groupby('start_time_index').agg('sum')

Out[90]: 
                          A       B
start_time_index                   
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -0.0902 -2.5178

Another potential shorter way to do this is to specify your sampling freq as the time delta. But the problem, as shown in below, is that it delivers 11 sub-samples instead of 10. I believe the reason is that the resample implements a left-inclusive/right-exclusive (or left-exclusive/right-inclusive) sub-sampling scheme so that the very last obs at '2015-02-03 00:00:00' is considered as a separate group. If we use pd.cut to do it ourself, we can specify include_lowest=True so that it gives us exactly 10 sub-samples rather than 11.

n = 10
time_delta_str = str((df.index[-1] - df.index[0]) / (pd.Timedelta('1s') * n)) + 's'
df.resample(pd.Timedelta(time_delta_str), how='sum')

Out[114]: 
                          A       B
2015-01-01 00:00:00  8.6133  2.7734
2015-01-04 07:12:00  1.9220 -0.8069
2015-01-07 14:24:00 -8.1334  0.2318
2015-01-10 21:36:00 -2.7572 -4.2862
2015-01-14 04:48:00  1.1957  7.2285
2015-01-17 12:00:00  3.2485  6.6841
2015-01-20 19:12:00 -0.8903  2.2802
2015-01-24 02:24:00 -2.1025  1.3800
2015-01-27 09:36:00 -1.1017  1.3108
2015-01-30 16:48:00 -2.2534 -3.8543
2015-02-03 00:00:00  2.1632  1.3365
like image 42
Jianxun Li Avatar answered Nov 01 '22 10:11

Jianxun Li