Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

daily data, resample every 3 days, calculate over trailing 5 days efficiently

consider the df

tidx = pd.date_range('2012-12-31', periods=11, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
df

I want to calculate the sum over a trailing 5 days, every 3 days.

I expect something that looks like this

enter image description here

this was edited
what I had was incorrect. @ivan_pozdeev and @boud noticed this was a centered window and that was not my intention. Appologies for the confusion.
everyone's solutions capture much of what I was after.


criteria

  • I'm looking for smart efficient solutions that can be scaled to large data sets.

  • I'll be timing solutions and also considering elegance.

  • Solutions should also be generalizable for a variety of sample and look back frequencies.


from comments

  • I want a solution that generalizes to handle a look back of a specified frequency and grab anything that falls within that look back.
    • for the sample above, the look back is 5D and there may be 4 or 50 observations that fall within that look back.
  • I want the timestamp to be the last observed timestamp within the look back period.
like image 414
piRSquared Avatar asked Oct 24 '16 01:10

piRSquared


2 Answers

the df you gave us is :

             A
2012-12-31   0
2013-01-01   1
2013-01-02   2
2013-01-03   3
2013-01-04   4
2013-01-05   5
2013-01-06   6
2013-01-07   7
2013-01-08   8
2013-01-09   9
2013-01-10  10

you could create your rolling 5-day sum series and then resample it. I can't think of a more efficient way than this. overall this should be relatively time efficient.

df.rolling(5,min_periods=5).sum().dropna().resample('3D').first()
Out[36]: 
                 A
2013-01-04 10.0000
2013-01-07 25.0000
2013-01-10 40.0000
like image 131
Steven G Avatar answered Oct 24 '22 09:10

Steven G


Listed here are two three few NumPy based solutions using bin based summing covering basically three scenarios.

Scenario #1 : Multiple entries per date, but no missing dates

Approach #1 :

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app1(df):
    # Extract the index names and values
    vals = df.A.values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()
    search_id = np.hstack((0,np.arange(2,date_id[-1],3),date_id[-1]+1))
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)

    # Perform bin based summing and subtract the repeated ones
    IDsums = np.bincount(id_arr,vals)
    allsums = IDsums[:-1] + IDsums[1:]
    allsums[1:] -= np.bincount(date_id,vals)[search_id[1:-2]]

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

Approach #2 :

# For now hard-coded to use Window size of 5 and stride length of 3
def vectorized_app2(df):
    # Extract the index names and values
    indx = df.index.values

    # Extract IDs for bin based summing
    mask = np.append(False,indx[1:] > indx[:-1])
    date_id = mask.cumsum()

    # Generate IDs at which shifts are to happen for a (2,3,5,8..) patttern    
    # Pad with 0 and length of array at either ends as we use diff later on
    shiftIDs = (np.arange(2,date_id[-1],3)[:,None] + np.arange(2)).ravel()
    search_id = np.hstack((0,shiftIDs,date_id[-1]+1))

    # Find the start of those shifting indices    
    # Generate ID based on shifts and do bin based summing of dataframe
    shifts = np.searchsorted(date_id,search_id)
    reps = shifts[1:] - shifts[:-1]
    id_arr = np.repeat(np.arange(len(reps)),reps)    
    IDsums = np.bincount(id_arr,df.A.values)

    # Sum each group of 3 elems with a stride of 2, make dataframe if needed
    allsums = IDsums[:-1:2] + IDsums[1::2] + IDsums[2::2]    

    # Convert to pandas dataframe if needed
    out_index = indx[np.nonzero(mask)[0][3::3]] # Use last date of group
    return pd.DataFrame(allsums,index=out_index,columns=['A'])

Approach #3 :

def vectorized_app3(df, S=3, W=5):
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

We could replace the convolution part with direct sliced summation for a modified version of it -

def vectorized_app3_v2(df, S=3, W=5):  
    dt = df.index.values
    shifts = np.append(False,dt[1:] > dt[:-1])
    c = np.bincount(shifts.cumsum(),df.A.values)
    f = c.size+S-W
    out = c[:f:S].copy()
    for i in range(1,W):
        out += c[i:f+i:S]
    out_index = dt[np.nonzero(shifts)[0][W-2::S]]
    return pd.DataFrame(out,index=out_index,columns=['A'])

Scenario #2 : Multiple entries per date and missing dates

Approach #4 :

def vectorized_app4(df, S=3, W=5):
    dt = df.index.values
    indx = np.append(0,((dt[1:] - dt[:-1])//86400000000000).astype(int)).cumsum()
    WL = ((indx[-1]+1)//S)
    c = np.bincount(indx,df.A.values,minlength=S*WL+(W-S))
    out = np.convolve(c,np.ones(W,dtype=int),'valid')[::S]
    grp0_lastdate = dt[0] + np.timedelta64(W-1,'D')
    freq_str = str(S)+'D'
    grp_last_dt = pd.date_range(grp0_lastdate, periods=WL, freq=freq_str).values
    out_index = dt[dt.searchsorted(grp_last_dt,'right')-1]
    return pd.DataFrame(out,index=out_index,columns=['A'])

Scenario #3 : Consecutive dates and exactly one entry per date

Approach #5 :

def vectorized_app5(df, S=3, W=5):
    vals = df.A.values
    N = (df.shape[0]-W+2*S-1)//S
    n = vals.strides[0]
    out = np.lib.stride_tricks.as_strided(vals,shape=(N,W),\
                                        strides=(S*n,n)).sum(1)
    index_idx = (W-1)+S*np.arange(N)
    out_index = df.index[index_idx]
    return pd.DataFrame(out,index=out_index,columns=['A'])

Suggestions for creating test-data

Scenario #1 :

# Setup input for multiple dates, but no missing dates
S = 4 # Stride length (Could be edited)
W = 7 # Window length (Could be edited)
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
start_df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
reps = np.random.randint(1,4,(len(start_df)))
idx0 = np.repeat(start_df.index,reps)
df_data = np.random.randint(0,9,(len(idx0)))
df = pd.DataFrame(df_data,index=idx0,columns=['A'])

Scenario #2 :

To create setup for multiple dates and with missing dates, we could just edit the df_data creation step, like so -

df_data = np.random.randint(0,9,(len(idx0)))

Scenario #3 :

# Setup input for exactly one entry per date
S = 4 # Could be edited
W = 7
datasize = 3  # Decides datasize
tidx = pd.date_range('2012-12-31', periods=datasize*S + W-S, freq='D')
df = pd.DataFrame(dict(A=np.arange(len(tidx))), tidx)
like image 26
Divakar Avatar answered Oct 24 '22 11:10

Divakar