Consider you've got some unevenly time series data:
import pandas as pd
import random as randy
ts = pd.Series(range(1000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e6,freq='U'),1000)).sort_index()
print ts.head()
2013-02-01 09:00:00.002895 995
2013-02-01 09:00:00.003765 499
2013-02-01 09:00:00.003838 797
2013-02-01 09:00:00.004727 295
2013-02-01 09:00:00.006287 253
Let's say I wanted to do the rolling sum over a 1ms window to get this:
2013-02-01 09:00:00.002895 995
2013-02-01 09:00:00.003765 499 + 995
2013-02-01 09:00:00.003838 797 + 499 + 995
2013-02-01 09:00:00.004727 295 + 797 + 499
2013-02-01 09:00:00.006287 253
Currently, I cast everything back to longs and do this in cython, but is this possible in pure pandas? I'm aware that you can do something like .asfreq('U') and then fill and use the traditional functions but this doesn't scale once you've got more than a toy # of rows.
As a point of reference, here's a hackish, not fast Cython version:
%%cython
import numpy as np
cimport cython
cimport numpy as np
ctypedef np.double_t DTYPE_t
def rolling_sum_cython(np.ndarray[long,ndim=1] times, np.ndarray[double,ndim=1] to_add, long window_size):
cdef long t_len = times.shape[0], s_len = to_add.shape[0], i =0, win_size = window_size, t_diff, j, window_start
cdef np.ndarray[DTYPE_t, ndim=1] res = np.zeros(t_len, dtype=np.double)
assert(t_len==s_len)
for i in range(0,t_len):
window_start = times[i] - win_size
j = i
while times[j]>= window_start and j>=0:
res[i] += to_add[j]
j-=1
return res
Demonstrating this on a slightly larger series:
ts = pd.Series(range(100000),index=randy.sample(pd.date_range('2013-02-01 09:00:00.000000',periods=1e8,freq='U'),100000)).sort_index()
%%timeit
res2 = rolling_sum_cython(ts.index.astype(int64),ts.values.astype(double),long(1e6))
1000 loops, best of 3: 1.56 ms per loop
You can solve most problems of this sort with cumsum and binary search.
from datetime import timedelta
def msum(s, lag_in_ms):
lag = s.index - timedelta(milliseconds=lag_in_ms)
inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
cs = s.cumsum()
return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)
res = msum(ts, 100)
print pd.DataFrame({'a': ts, 'a_msum_100': res})
a a_msum_100
2013-02-01 09:00:00.073479 5 5
2013-02-01 09:00:00.083717 8 13
2013-02-01 09:00:00.162707 1 14
2013-02-01 09:00:00.171809 6 20
2013-02-01 09:00:00.240111 7 14
2013-02-01 09:00:00.258455 0 14
2013-02-01 09:00:00.336564 2 9
2013-02-01 09:00:00.536416 3 3
2013-02-01 09:00:00.632439 4 7
2013-02-01 09:00:00.789746 9 9
[10 rows x 2 columns]
You need a way of handling NaNs and depending on your application, you may need the prevailing value asof the lagged time or not (ie difference between using kdb+ bin vs np.searchsorted).
Hope this helps.
This is an old question, but for those who stumble upon this from google: in pandas 0.19 this is built-in as the function
http://pandas.pydata.org/pandas-docs/stable/computation.html#time-aware-rolling
So to get 1 ms windows it looks like you get a Rolling object by doing
dft.rolling('1ms')
and the sum would be
dft.rolling('1ms').sum()
Perhaps it makes more sense to use rolling_sum
:
pd.rolling_sum(ts, window=1, freq='1ms')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With