Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance issue turning rows with start - end into a dataframe with TimeIndex

I have a large dataset where each line represents the value of a certain type (think a sensor) for a time interval (between start and end). It looks like this:

    start       end    type value
2015-01-01  2015-01-05  1   3
2015-01-06  2015-01-08  1   2
2015-01-05  2015-01-08  3   3
2015-01-13  2015-01-16  2   1

I want to turn it into a daily time-indexed frame like this:

day       type  value
2015-01-01  1   3
2015-01-02  1   3
2015-01-03  1   3
2015-01-04  1   3
2015-01-05  1   3
2015-01-06  1   2
2015-01-07  1   2
2015-01-08  1   2
2015-01-05  3   3
2015-01-16  3   3
2015-01-07  3   3
2015-01-08  3   3
2015-01-13  2   1
2015-01-14  2   1
2015-01-15  2   1
2015-01-16  2   1

(Note that we cannot make any assumption regarding the interval: they should be contiguous and non-overlapping but we cannot guarantee that)

Based on these Stack Overflow answers [1] (DataFrame resample on date ranges) [2] (pandas: Aggregate based on start/end date), there seem to exist two methods: one around itertuples, one around melt (2 above used stack/unstack but it is similar to melt). Let us compare them for performance.

# Creating a big enough dataframe
date_range = pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2019,12,31), freq='4D')
to_concat = []
for val in range(1,50):
    frame_tmp = pd.DataFrame()
    frame_tmp['start'] = date_range
    frame_tmp['end'] = frame_tmp['start']+ dt.timedelta(3)
    frame_tmp['type'] = val
    frame_tmp['value'] = np.random.randint(1, 6, frame_tmp.shape[0])
    to_concat.append(frame_tmp)
df = pd.concat(to_concat, ignore_index=True)

# Method 1 
def method_1(df):
    df1 = (pd.concat([pd.Series(r.Index,
                                pd.date_range(r.start,
                                              r.end,
                                              freq='D'))
                      for r in df.itertuples()])) \
        .reset_index()
    df1.columns = ['start_2', 'idx']

    df2 = df1.set_index('idx').join(df).reset_index(drop=True)

    return df2.set_index('start_2')

df_method_1=df.groupby(['type']).apply(method_1)

# Method 2
df_tmp= df.reset_index()
df1 = (df_tmp.melt(df_tmp.columns.difference(['start','end']),
          ['start', 'end'],
          value_name='current_time')
  )
df_method_2 = df1.set_index('current_time').groupby('index', group_keys=False)\
.resample('D').ffill()

With %%timeit in Jupyter, method 1 takes ~8s and method 2 takes ~25s for the dataframe defined as example. This is way too slow as the real dataset that I am dealing with is much bigger than this. On that dataframe, method 1 takes ~20 minutes.

Do you have any idea on how to make this faster?

like image 506
Phik Avatar asked Jun 07 '18 17:06

Phik


1 Answers

This is about 1.7 times faster than your method_1 and a little tidier:

df_expand = pd.DataFrame.from_records(
    (
        (d, r.type, r.value) 
        for r in df.itertuples()
        for d in pd.date_range(start=r.start, end=r.end, freq='D')
    ),
    columns=['day', 'type', 'row']
)

You can get about 7 times faster by creating your own date range instead of calling pd.date_range():

one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(
    (
        (r.start + i * one_day, r.type, r.value) 
        for r in df.itertuples()
        for i in range(int((r.end-r.start)/one_day)+1)
    ),
    columns=['day', 'type', 'row']
)

Or you can get up to 24 times faster by using numpy's arange function to generate the dates:

one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(
    (
        (d, r.type, r.value) 
        for r in df.itertuples()
        for d in np.arange(r.start.date(), r.end.date()+one_day, dtype='datetime64[D]')
    ),
    columns=['day', 'type', 'row']
)

I couldn't resist adding one more that is a little more than twice as fast as the last one. Unfortunately, it's a lot harder to read. This groups the readings based on how many days they span ('dur'), then uses vectorized numpy operations to expand each group in a single batch.

def expand_group(g):
    dur = g.dur.iloc[0] # how many days for each reading in this group?
    return pd.DataFrame({
        'day': (g.start.values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)).ravel(),
        'type': np.repeat(g.type.values, dur),
        'value': np.repeat(g.value.values, dur),
    })
# take all readings with the same duration and process them together using vectorized code
df_expand = (
    df.assign(dur=(df['end']-df['start']).dt.days + 1)
    .groupby('dur').apply(expand_group)
    .reset_index('dur', drop=True)
)

Update: Responding to your comment, below is a simplified version of the vectorized approach, which is faster and easier to read. Instead of using the groupby step, this makes a single matrix as wide as the longest reading, then filters out the unneeded entries. This should be pretty efficient unless the maximum duration for your readings is much longer than the average. With the test dataframe (all readings lasting 4 days), this is about 15x faster than the groupby solution and about 700x faster than method_1.

dur = (df['end']-df['start']).max().days + 1
df_expand = pd.DataFrame({
    'day': (
        df['start'].values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)
    ).ravel(),
    'type': np.repeat(df['type'].values, dur),
    'value': np.repeat(df['value'].values, dur),
    'end': np.repeat(df['end'].values, dur),
})
df_expand = df_expand.loc[df_expand['day']<=df_expand['end'], 'day':'value']
like image 154
Matthias Fripp Avatar answered Oct 18 '22 06:10

Matthias Fripp