Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Rolling window - uneven interval

I'm having an issue working out a rolling count of transactions applicable to each individual buyer in this dataset structured as follows:

userID  itemID      transaction_ts
3229    4493320     2016-01-02 14:55:00
3229    4492492     2016-01-02 14:57:02
3229    4496756     2016-01-04 09:01:18
3229    4493673     2016-01-04 09:11:10
3229    4497531     2016-01-04 11:05:25
3229    4495006     2016-01-05 07:25:11
4330    4500695     2016-01-02 09:17:21
4330    4500656     2016-01-03 09:19:28
4330    4503087     2016-01-04 07:42:15
4330    4501846     2016-01-04 08:55:24
4330    4504105     2016-01-04 09:59:35

Ideally, it would look like the below for a rolling transaction count window of e.g. 24 hours:

userID  itemID      transaction_ts        rolling_count
3229    4493320     2016-01-02 14:55:00         1
3229    4492492     2016-01-02 14:57:02         2
3229    4496756     2016-01-04 09:01:18         1
3229    4493673     2016-01-04 09:11:10         2
3229    4497531     2016-01-04 11:05:25         3
3229    4495006     2016-01-05 07:25:11         4
4330    4500695     2016-01-02 09:17:21         1
4330    4500656     2016-01-03 09:19:28         1
4330    4503087     2016-01-04 07:42:15         2 
4330    4501846     2016-01-04 08:55:24         3
4330    4504105     2016-01-04 09:59:35         3

There is an excellent answer to a similar problem here: pandas rolling sum of last five minutes

However, this answer depends solely on the timestamp field, unlike the above where the rolling count must reset to 1 upon encountering a transaction from a different user to that of the row above. It is possible to find a solution via slicing but given the size of this dataset (potentially 1m+ rows) that is not feasible.

Crucially, the window should reflect the 24 hour period prior to the transactional_ts of the respective row, hence why I think a custom df.apply or rolling_window method is appropriate, I just can't figure out how to make that conditional on the userID.

like image 478
tompiler Avatar asked May 13 '16 18:05

tompiler


1 Answers

A part of the solution (a rolling cumsum) may already be here. (I only changed the type of lag):

from datetime import timedelta

def msum(s, lag):
    lag = s.index - timedelta(days=lag)
    inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
    cs = s.cumsum()
    return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)

The function requires an index to be of datetime type. Moreover, the index within each userID group should be already sorted (for instance as in your example).

df = df.set_index('transaction_ts')
df['rolling_count'] = 1
df['rolling_count'] = df.groupby('userID', sort=False)['rolling_count'].transform(lambda x : msum(x,1))

A groupby option sort=False may give some speed up. (It's responsible for sorting group keys.)

like image 141
ptrj Avatar answered Oct 13 '22 23:10

ptrj