I'm having an issue working out a rolling count of transactions applicable to each individual buyer in this dataset structured as follows:
userID itemID transaction_ts
3229 4493320 2016-01-02 14:55:00
3229 4492492 2016-01-02 14:57:02
3229 4496756 2016-01-04 09:01:18
3229 4493673 2016-01-04 09:11:10
3229 4497531 2016-01-04 11:05:25
3229 4495006 2016-01-05 07:25:11
4330 4500695 2016-01-02 09:17:21
4330 4500656 2016-01-03 09:19:28
4330 4503087 2016-01-04 07:42:15
4330 4501846 2016-01-04 08:55:24
4330 4504105 2016-01-04 09:59:35
Ideally, it would look like the below for a rolling transaction count window of e.g. 24 hours:
userID itemID transaction_ts rolling_count
3229 4493320 2016-01-02 14:55:00 1
3229 4492492 2016-01-02 14:57:02 2
3229 4496756 2016-01-04 09:01:18 1
3229 4493673 2016-01-04 09:11:10 2
3229 4497531 2016-01-04 11:05:25 3
3229 4495006 2016-01-05 07:25:11 4
4330 4500695 2016-01-02 09:17:21 1
4330 4500656 2016-01-03 09:19:28 1
4330 4503087 2016-01-04 07:42:15 2
4330 4501846 2016-01-04 08:55:24 3
4330 4504105 2016-01-04 09:59:35 3
There is an excellent answer to a similar problem here: pandas rolling sum of last five minutes
However, this answer depends solely on the timestamp field, unlike the above where the rolling count must reset to 1 upon encountering a transaction from a different user to that of the row above. It is possible to find a solution via slicing but given the size of this dataset (potentially 1m+ rows) that is not feasible.
Crucially, the window should reflect the 24 hour period prior to the transactional_ts of the respective row, hence why I think a custom df.apply or rolling_window method is appropriate, I just can't figure out how to make that conditional on the userID.
A part of the solution (a rolling cumsum) may already be here. (I only changed the type of lag):
from datetime import timedelta
def msum(s, lag):
lag = s.index - timedelta(days=lag)
inds = np.searchsorted(s.index.astype(np.int64), lag.astype(np.int64))
cs = s.cumsum()
return pd.Series(cs.values - cs[inds].values + s[inds].values, index=s.index)
The function requires an index to be of datetime type. Moreover, the index within each userID group should be already sorted (for instance as in your example).
df = df.set_index('transaction_ts')
df['rolling_count'] = 1
df['rolling_count'] = df.groupby('userID', sort=False)['rolling_count'].transform(lambda x : msum(x,1))
A groupby option sort=False
may give some speed up. (It's responsible for sorting group keys.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With