Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Resampling with custom periods

Tags:

python

pandas

Is there a 'cookbook' way of resampling a DataFrame with (semi)irregular periods?

I have a dataset at a daily interval and want it to resample to what sometimes (in scientific literature) is named dekad's. I dont think there is a proper English term for it but its basically chopping a month in three ~ten-day parts where the third is a remainder of anything between 8 and 11 days.

I came up with two solutions myself, a specific one for this case and a more general one for any irregular periods. But both arent really good, so im curiuous how others handle these type of situations.

Lets start with creating some sample data:

import pandas as pd

begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)

dtrange = pd.date_range(begin, end)

p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10

df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)

The first thing i came up with is grouping by individual months (YYYYMM) and then slicing it manually. Like:

def to_dec1(data, func):

    # create the indexes, start of the ~10day period
    idx1 = pd.datetime(data.index[0].year, data.index[0].month, 1)
    idx2 = idx1 + datetime.timedelta(days=10)
    idx3 = idx2 + datetime.timedelta(days=10)

    # slice the period and perform function
    oneday = datetime.timedelta(days=1)
    fir = func(data.ix[:idx2 - oneday].values, axis=0)
    sec = func(data.ix[idx2:idx3 - oneday].values, axis=0)
    thi = func(data.ix[idx3:].values, axis=0)

    return pd.DataFrame([fir,sec,thi], index=[idx1,idx2,idx3], columns=data.columns)

dfmean = df.groupby(lambda x: x.strftime('%Y%m'), group_keys=False).apply(to_dec1, np.mean)

Which results in:

print dfmean

                  p1         p2
2013-01-01  5.436778  10.409845
2013-01-11  5.534509  10.482231
2013-01-21  5.449058  10.454777
2013-02-01  5.685700  10.422697
2013-02-11  5.578137  10.532180
2013-02-21       NaN        NaN

Note that you always get a full month of 'dekads' in return, its not a problem and easy to remove if needed.

The other solution works by providing a range of dates at which you chop up the DataFrame and perform a function on each segment. Its more flexible in terms of the periods you want.

def to_dec2(data, dts, func):

    chucks = []
    for n,start in enumerate(dts[:-1]):

        end = dts[n+1] - datetime.timedelta(days=1)
        chucks.append(func(data.ix[start:end].values, axis=0))

    return pd.DataFrame(chucks, index=dts[:-1], columns=data.columns)

dfmean2 = to_dec2(df, dfmean.index, np.mean)

Note that im using the index of the previous result as the range of dates to save some time 'building' it myself.

What would be the best way of handling these cases? Is there perhaps a bit more build-in method in Pandas?

like image 835
Rutger Kassies Avatar asked Mar 14 '13 11:03

Rutger Kassies


People also ask

What is resampling time series data?

The resampling recipe transforms time series data occurring in irregular time intervals into equispaced data. The recipe is also useful for transforming equispaced data from one frequency level to another (for example, minutes to hours).

How do I resample data in pandas?

Pandas Series: resample() functionThe resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

How do you resample hourly data in Python?

Resample Hourly Data to Daily Data To simplify your plot which has a lot of data points due to the hourly records, you can aggregate the data for each day using the . resample() method. To aggregate or temporal resample the data for a time period, you can take all of the values for each day and summarize them.

What does DF resample do?

Pandas DataFrame. resample() takes in a DatetimeIndex and spits out data that has been converted to a new time frequency. Pseudo Code: Convert a DataFrame time range into a different time frequency.


2 Answers

If you use numpy 1.7, you can use datetime64 & timedelta64 arrays to do the calculation:

create the sample data:

import pandas as pd
import numpy as np

begin = pd.datetime(2013,1,1)
end = pd.datetime(2013,2,20)

dtrange = pd.date_range(begin, end)

p1 = np.random.rand(len(dtrange)) + 5
p2 = np.random.rand(len(dtrange)) + 10

df = pd.DataFrame({'p1': p1, 'p2': p2}, index=dtrange)

calculate the dekad's date:

d = df.index.day - np.clip((df.index.day-1) // 10, 0, 2)*10 - 1
date = df.index.values - np.array(d, dtype="timedelta64[D]")
df.groupby(date).mean()

The output is:

                 p1         p2
2013-01-01  5.413795  10.445640
2013-01-11  5.516063  10.491339
2013-01-21  5.539676  10.528745
2013-02-01  5.783467  10.478001
2013-02-11  5.358787  10.579149
like image 127
HYRY Avatar answered Oct 19 '22 18:10

HYRY


Using HYRY's data and solution up to the computation of the d variable, we can also do the following in pandas 0.11-dev or later (regardless of numpy version):

In [18]: from datetime import timedelta

In [23]: pd.Series([ timedelta(int(i)) for i in d ])
Out[23]: 
0             00:00:00
1     1 days, 00:00:00
2     2 days, 00:00:00
3     3 days, 00:00:00
4     4 days, 00:00:00
5     5 days, 00:00:00
6     6 days, 00:00:00
7     7 days, 00:00:00
8     8 days, 00:00:00
9     9 days, 00:00:00
10            00:00:00

47    6 days, 00:00:00
48    7 days, 00:00:00
49    8 days, 00:00:00
50    9 days, 00:00:00
Length: 51, dtype: timedelta64[ns]

The date is constructed similary to above

date = pd.Series(df.index) - pd.Series([ timedelta(int(i)) for i in d ])
df.groupby(date.values).mean()
like image 27
Jeff Avatar answered Oct 19 '22 18:10

Jeff