Is there a 'cookbook' way of resampling a DataFrame with (semi)irregular periods?
I have a dataset at a daily interval and want it to resample to what sometimes (in scientific literature) is named dekad's. I dont think there is a proper English term for it but its basically chopping a month in three ~ten-day parts where the third is a remainder of anything between 8 and 11 days.
I came up with two solutions myself, a specific one for this case and a more general one for any irregular periods. But both arent really good, so im curiuous how others handle these type of situations.
Lets start with creating some sample data:
import pandas as pd
begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)
dtrange = pd.date_range(begin, end)
p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10
df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)
The first thing i came up with is grouping by individual months (YYYYMM) and then slicing it manually. Like:
def to_dec1(data, func):
# create the indexes, start of the ~10day period
idx1 = pd.datetime(data.index[0].year, data.index[0].month, 1)
idx2 = idx1 + datetime.timedelta(days=10)
idx3 = idx2 + datetime.timedelta(days=10)
# slice the period and perform function
oneday = datetime.timedelta(days=1)
fir = func(data.ix[:idx2 - oneday].values, axis=0)
sec = func(data.ix[idx2:idx3 - oneday].values, axis=0)
thi = func(data.ix[idx3:].values, axis=0)
return pd.DataFrame([fir,sec,thi], index=[idx1,idx2,idx3], columns=data.columns)
dfmean = df.groupby(lambda x: x.strftime('%Y%m'), group_keys=False).apply(to_dec1, np.mean)
Which results in:
print dfmean
p1 p2
2013-01-01 5.436778 10.409845
2013-01-11 5.534509 10.482231
2013-01-21 5.449058 10.454777
2013-02-01 5.685700 10.422697
2013-02-11 5.578137 10.532180
2013-02-21 NaN NaN
Note that you always get a full month of 'dekads' in return, its not a problem and easy to remove if needed.
The other solution works by providing a range of dates at which you chop up the DataFrame and perform a function on each segment. Its more flexible in terms of the periods you want.
def to_dec2(data, dts, func):
chucks = []
for n,start in enumerate(dts[:-1]):
end = dts[n+1] - datetime.timedelta(days=1)
chucks.append(func(data.ix[start:end].values, axis=0))
return pd.DataFrame(chucks, index=dts[:-1], columns=data.columns)
dfmean2 = to_dec2(df, dfmean.index, np.mean)
Note that im using the index of the previous result as the range of dates to save some time 'building' it myself.
What would be the best way of handling these cases? Is there perhaps a bit more build-in method in Pandas?
The resampling recipe transforms time series data occurring in irregular time intervals into equispaced data. The recipe is also useful for transforming equispaced data from one frequency level to another (for example, minutes to hours).
Pandas Series: resample() functionThe resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Resample Hourly Data to Daily Data To simplify your plot which has a lot of data points due to the hourly records, you can aggregate the data for each day using the . resample() method. To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them.
Pandas DataFrame. resample() takes in a DatetimeIndex and spits out data that has been converted to a new time frequency. Pseudo Code: Convert a DataFrame time range into a different time frequency.
If you use numpy 1.7, you can use datetime64 & timedelta64 arrays to do the calculation:
create the sample data:
import pandas as pd
import numpy as np
begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)
dtrange = pd.date_range(begin, end)
p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10
df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)
calculate the dekad's date:
d = df.index.day - np.clip((df.index.day-1) // 10, 0, 2)*10 - 1
date = df.index.values - np.array(d, dtype="timedelta64[D]")
df.groupby(date).mean()
The output is:
p1 p2
2013-01-01 5.413795 10.445640
2013-01-11 5.516063 10.491339
2013-01-21 5.539676 10.528745
2013-02-01 5.783467 10.478001
2013-02-11 5.358787 10.579149
Using HYRY's data and solution up to the computation of the d
variable, we can also do the following in pandas 0.11-dev or later (regardless of numpy version):
In [18]: from datetime import timedelta
In [23]: pd.Series([ timedelta(int(i)) for i in d ])
Out[23]:
0 00:00:00
1 1 days, 00:00:00
2 2 days, 00:00:00
3 3 days, 00:00:00
4 4 days, 00:00:00
5 5 days, 00:00:00
6 6 days, 00:00:00
7 7 days, 00:00:00
8 8 days, 00:00:00
9 9 days, 00:00:00
10 00:00:00
47 6 days, 00:00:00
48 7 days, 00:00:00
49 8 days, 00:00:00
50 9 days, 00:00:00
Length: 51, dtype: timedelta64[ns]
The date is constructed similary to above
date = pd.Series(df.index) - pd.Series([ timedelta(int(i)) for i in d ])
df.groupby(date.values).mean()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With