Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Rolling Computations on Sliding Windows (Unevenly spaced)

Tags:

python

pandas

Consider you've got some unevenly time series data:

import pandas as pd
import random as randy
ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index()
print ts.head()


2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499
2013-02-01 09:00:00.003838    797
2013-02-01 09:00:00.004727    295
2013-02-01 09:00:00.006287    253

Let's say I wanted to do the rolling sum over a 1ms window to get this:

2013-02-01 09:00:00.002895    995
2013-02-01 09:00:00.003765    499 + 995
2013-02-01 09:00:00.003838    797 + 499 + 995
2013-02-01 09:00:00.004727    295 + 797 + 499
2013-02-01 09:00:00.006287    253

Currently, I cast everything back to longs and do this in cython, but is this possible in pure pandas? I'm aware that you can do something like .asfreq('U') and then fill and use the traditional functions but this doesn't scale once you've got more than a toy # of rows.

As a point of reference, here's a hackish, not fast Cython version:

%%cython
import numpy as np
cimport cython
cimport numpy as np

ctypedef np.double_t DTYPE_t

def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size):
    cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start
    cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double)
    assert(t_len==s_len)
    for i in range(0,t_len):
        window_start = times[i] - win_size
        j = i
        while times[j]>= window_start and j>=0:
            res[i] += to_add[j]
            j-=1
    return res   

Demonstrating this on a slightly larger series:

ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index()

%%timeit
res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6))
1000 loops, best of 3: 1.56 ms per loop
like image 329
radikalus Avatar asked Jan 31 '13 16:01

radikalus


3 Answers

You can solve most problems of this sort with cumsum and binary search.

from datetime import timedelta

def msum(s, lag_in_ms):
    lag = s.index - timedelta(milliseconds=lag_in_ms)
    inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
    cs = s.cumsum()
    return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)

res = msum(ts, 100)
print pd.DataFrame({'a': ts, 'a_msum_100': res})


                            a  a_msum_100
2013-02-01 09:00:00.073479  5           5
2013-02-01 09:00:00.083717  8          13
2013-02-01 09:00:00.162707  1          14
2013-02-01 09:00:00.171809  6          20
2013-02-01 09:00:00.240111  7          14
2013-02-01 09:00:00.258455  0          14
2013-02-01 09:00:00.336564  2           9
2013-02-01 09:00:00.536416  3           3
2013-02-01 09:00:00.632439  4           7
2013-02-01 09:00:00.789746  9           9

[10 rows x 2 columns]

You need a way of handling NaNs and depending on your application, you may need the prevailing value asof the lagged time or not (ie difference between using kdb+ bin vs np.searchsorted).

Hope this helps.

like image 176
signalseeker Avatar answered Oct 22 '22 09:10

signalseeker


This is an old question, but for those who stumble upon this from google: in pandas 0.19 this is built-in as the function

http://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling

So to get 1 ms windows it looks like you get a Rolling object by doing

dft.rolling('1ms')

and the sum would be

dft.rolling('1ms').sum()
like image 41
Kevin Wang Avatar answered Oct 22 '22 09:10

Kevin Wang


Perhaps it makes more sense to use rolling_sum:

pd.rolling_sum(ts, window=1, freq='1ms')
like image 1
Andy Hayden Avatar answered Oct 22 '22 08:10

Andy Hayden