Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas time series time between events

Tags:

python

pandas

How can I calculate the time (number of days) between "events" in a Pandas time series? For example, if I have the below time series I'd like to know on each day in the series how many days have passed since the last TRUE

            event
2010-01-01  False
2010-01-02   True
2010-01-03  False
2010-01-04  False
2010-01-05   True
2010-01-06  False

The way I've done it seems overcomplicated, so I'm hoping for something more elegant. Obviously a for loop iterating over the rows would work, but I'm looking for a vectorized (scalable) solution ideally. My current attempt below:

date_range = pd.date_range('2010-01-01', '2010-01-06')
df = pd.DataFrame([False, True, False, False, True, False], index=date_range, columns=['event'])
event_dates = df.index[df['event']]
df2 = pd.DataFrame(event_dates, index=event_dates, columns=['max_event_date'])
df = df.join(df2)
df['max_event_date'] = df['max_event_date'].cummax(axis=0, skipna=False)
df['days_since_event'] = df.index - df['max_event_date']

            event max_event_date  days_since_event
2010-01-01  False            NaT               NaT
2010-01-02   True     2010-01-02            0 days
2010-01-03  False     2010-01-02            1 days
2010-01-04  False     2010-01-02            2 days
2010-01-05   True     2010-01-05            0 days
2010-01-06  False     2010-01-05            1 days
like image 749
user1507844 Avatar asked Mar 17 '15 04:03

user1507844


2 Answers

Continuing to improve on this answer, and hoping that someone comes in with 'the' pythonic way. Until then, I think this final update works best.

last = pd.to_datetime(np.nan)
def elapsed(row):
    if not row.event:
        return row.name - last
    else:
        global last
        last = row.name
        return row.name-last

df['elapsed'] = df.apply(elapsed,axis=1)

df
            event  elapsed
2010-01-01  False      NaT
2010-01-02   True   0 days
2010-01-03  False   1 days
2010-01-04  False   2 days
2010-01-05   True   0 days
2010-01-06  False   1 days

:::::::::::::

Leaving previous answers below although they are sub-optimal

:::::::::

Instead of making multiple passes through, seems easier to to just loop through the indexes

df['elapsed'] = 0
for i in df.index[1:]:
    if not df['event'][i]:
        df['elapsed'][i] = df['elapsed'][i-1] + 1

::::::::::::

Let's say 'Trues' are your event of interest.

trues = df[df.event==True]
trues.Dates = trues.index #need this because .diff() doesn't work on the index
trues.Elapsed = trues.Dates.diff()
like image 173
EngineeredE Avatar answered Oct 12 '22 03:10

EngineeredE


A one-pass solution would certainly be ideal, but here's a multi-pass solution using only (presumably) cythonized pandas functions:

def get_delay(ds):
    x1 = (~ds).cumsum()
    x2 = x1.where(ds, np.nan).ffill()
    return x1 - x2

date_range = pd.date_range('2010-01-01', '2010-01-06')
ds = pd.Series([False, True, False, False, True, False], index=date_range)
pd.concat([ds, get_delay(ds)], axis=1)


            Event   Last
2010-01-01  False   NaN
2010-01-02  True    0
2010-01-03  False   1
2010-01-04  False   2
2010-01-05  True    0
2010-01-06  False   1

And interestingly it seems to perform a little better in some quick benchmarks, possibly due to avoiding row-wise operations:

%%timeit -n 1000

    def get_delay(ds):
        x1 = (~ds).cumsum()
        x2 = x1.where(ds, np.nan).ffill()
        return x1 - x2

    n = 100
    events = np.random.choice([True, False], size=n)
    date_range = pd.date_range('2010-01-01', periods=n)
    df = pd.DataFrame(events, index=date_range, columns=['event'])
    get_delay(df['event'])

1000 loops, best of 3: 1.09 ms per loop

Versus the single loop approach with a global:

%%timeit -n 1000

last = pd.to_datetime(np.nan)
def elapsed(row):
    if not row.event:
        return row.name - last
    else:
        global last
        last = row.name
        return row.name-last


n = 100
events = np.random.choice([True, False], size=n)
date_range = pd.date_range('2010-01-01', periods=n)
df = pd.DataFrame(events, index=date_range, columns=['event'])
df.apply(elapsed, axis=1)

1000 loops, best of 3: 2.4 ms per loop

Perhaps there's some nuance in that comparison that doesn't make it fair but either way, the no-custom-functions version certainly doesn't seem to be a whole lot slower, if at all.

like image 20
Eric Czech Avatar answered Oct 12 '22 03:10

Eric Czech