How can I calculate the time (number of days) between "events" in a Pandas time series? For example, if I have the below time series I'd like to know on each day in the series how many days have passed since the last TRUE
event
2010-01-01 False
2010-01-02 True
2010-01-03 False
2010-01-04 False
2010-01-05 True
2010-01-06 False
The way I've done it seems overcomplicated, so I'm hoping for something more elegant. Obviously a for loop iterating over the rows would work, but I'm looking for a vectorized (scalable) solution ideally. My current attempt below:
date_range = pd.date_range('2010-01-01', '2010-01-06')
df = pd.DataFrame([False, True, False, False, True, False], index=date_range, columns=['event'])
event_dates = df.index[df['event']]
df2 = pd.DataFrame(event_dates, index=event_dates, columns=['max_event_date'])
df = df.join(df2)
df['max_event_date'] = df['max_event_date'].cummax(axis=0, skipna=False)
df['days_since_event'] = df.index - df['max_event_date']
event max_event_date days_since_event
2010-01-01 False NaT NaT
2010-01-02 True 2010-01-02 0 days
2010-01-03 False 2010-01-02 1 days
2010-01-04 False 2010-01-02 2 days
2010-01-05 True 2010-01-05 0 days
2010-01-06 False 2010-01-05 1 days
Continuing to improve on this answer, and hoping that someone comes in with 'the' pythonic way. Until then, I think this final update works best.
last = pd.to_datetime(np.nan)
def elapsed(row):
if not row.event:
return row.name - last
else:
global last
last = row.name
return row.name-last
df['elapsed'] = df.apply(elapsed,axis=1)
df
event elapsed
2010-01-01 False NaT
2010-01-02 True 0 days
2010-01-03 False 1 days
2010-01-04 False 2 days
2010-01-05 True 0 days
2010-01-06 False 1 days
:::::::::::::
Leaving previous answers below although they are sub-optimal
:::::::::
Instead of making multiple passes through, seems easier to to just loop through the indexes
df['elapsed'] = 0
for i in df.index[1:]:
if not df['event'][i]:
df['elapsed'][i] = df['elapsed'][i-1] + 1
::::::::::::
Let's say 'Trues' are your event of interest.
trues = df[df.event==True]
trues.Dates = trues.index #need this because .diff() doesn't work on the index
trues.Elapsed = trues.Dates.diff()
A one-pass solution would certainly be ideal, but here's a multi-pass solution using only (presumably) cythonized pandas functions:
def get_delay(ds):
x1 = (~ds).cumsum()
x2 = x1.where(ds, np.nan).ffill()
return x1 - x2
date_range = pd.date_range('2010-01-01', '2010-01-06')
ds = pd.Series([False, True, False, False, True, False], index=date_range)
pd.concat([ds, get_delay(ds)], axis=1)
Event Last
2010-01-01 False NaN
2010-01-02 True 0
2010-01-03 False 1
2010-01-04 False 2
2010-01-05 True 0
2010-01-06 False 1
And interestingly it seems to perform a little better in some quick benchmarks, possibly due to avoiding row-wise operations:
%%timeit -n 1000
def get_delay(ds):
x1 = (~ds).cumsum()
x2 = x1.where(ds, np.nan).ffill()
return x1 - x2
n = 100
events = np.random.choice([True, False], size=n)
date_range = pd.date_range('2010-01-01', periods=n)
df = pd.DataFrame(events, index=date_range, columns=['event'])
get_delay(df['event'])
1000 loops, best of 3: 1.09 ms per loop
Versus the single loop approach with a global:
%%timeit -n 1000
last = pd.to_datetime(np.nan)
def elapsed(row):
if not row.event:
return row.name - last
else:
global last
last = row.name
return row.name-last
n = 100
events = np.random.choice([True, False], size=n)
date_range = pd.date_range('2010-01-01', periods=n)
df = pd.DataFrame(events, index=date_range, columns=['event'])
df.apply(elapsed, axis=1)
1000 loops, best of 3: 2.4 ms per loop
Perhaps there's some nuance in that comparison that doesn't make it fair but either way, the no-custom-functions version certainly doesn't seem to be a whole lot slower, if at all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With