Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe join on overlapped time ranges

I have two dataframes. Each one has a timestamp index representing the start time and a duration value (in seconds) which could be used to calculate the end time. The time interval and duration is different for each dataframe, and could vary within each dataframe as well.

                     duration   param1
Start Time (UTC) 
2017-10-14 02:00:31   60         95
2017-10-14 02:01:31   60         34
2017-10-14 02:02:31   60         10
2017-10-14 02:03:31   60         44
2017-10-14 02:04:31   60         63
2017-10-14 02:05:31   60         52
...

                     duration   param2
Start Time (UTC)
2017-10-14 02:00:00   300        93
2017-10-14 02:05:00   300        95
2017-10-14 02:10:00   300        91
...

I want to join these two dataframes such the index and columns of the first are maintained but the parameter values from the second are copied to it using the following scheme:

For each row in the first dataframe, assign the param2 value from the first row in the (sorted) second dataframe which contains 50% or more of the time range.

Example output below:

                     duration   param1   param2
Start Time (UTC) 
2017-10-14 02:00:31   60         95        93
2017-10-14 02:01:31   60         34        93
2017-10-14 02:02:31   60         10        93
2017-10-14 02:03:31   60         44        93
2017-10-14 02:04:31   60         63        95
2017-10-14 02:05:31   60         52        95
...
like image 240
davegravy Avatar asked Nov 07 '22 14:11

davegravy


1 Answers

Here's an approach which mostly solves this, but with some simplifications to the problem. The code can probably be extended to solve those too, as described. This solution is also robust to timeseries gaps (skipped index values) as well as timeseries blanks (intentional NaNs).

import numpy as np
import pandas as pd

def merge_nearest(df_left, df_right):
    """
    Assumptions:
        1. constant duration in df_left # could be solved
           with a `df_left.groupby('duration')` which calls
           this function on each group
        2. which is always less than or equal to the variable
           duration of df_right # could probably just
           programatically get the min
    """
    df_left = df_left.sort_index()
    df_right = df_right.sort_index()
    min_duration = df_left['duration'].min() # seconds

    # merge nearest start times together, still blank df_right
    # values for the rest of each interval's duration
    matched = pd.merge_asof(df_left, df_right, left_index=True,
                            right_index=True, suffixes=('_left', '_right'),
                            tolerance=pd.Timedelta(min_duration / 2, unit='s'),
                            direction='nearest')

    # fancy forward fill that uses a variable timedelta-based limit
    righteous_cols = [col + '_right' if col in df_left.columns else col \
                      for col in df_right.columns]
    store_index = matched.index
    duration_string = f'{int(np.round(min_duration))}s'
    index_gaps_to_blanks = pd.date_range(start=matched.index.min().round(duration_string),
                                         end=matched.index.max().round(duration_string),
                                         freq=duration_string)
    rounded = matched.index.round(duration_string)
    tolerances = matched.index - rounded
    matched.index = rounded
    matched = matched.reindex(index=index_gaps_to_blanks)
    # this ffill is just to group properly
    grouped = matched.fillna(method='ffill').groupby('duration_right', sort=False)
    for duration, index_group in grouped.groups.items():
        fill_limit = int(np.round(duration / min_duration)) - 1
        if fill_limit > 0:
            matched.loc[index_group, righteous_cols] = \
            matched.loc[index_group, righteous_cols].fillna(method='ffill',
                                                            limit=fill_limit)
    matched = matched.reindex(index=store_index, method='nearest', tolerance=np.abs(tolerances))
    return matched

Testing it out:

# sample data

# 1 minute timeseries with 1 day gap
arr = np.linspace(25, 55, 100)
sotime = pd.date_range(start='2017-10-14 02:00:31', freq='1min', 
                       periods=100, name='Start Time (UTC)')
sotime = sotime[:27].append(sotime[27:] + pd.Timedelta(1, unit='day'))
sodf = pd.DataFrame(dict(level=arr.round(2), duration=[60.0] * 100), index=sotime)

# an offset 5, 10, 1 minute timeseries also with an offset 1 day gap
arr = np.linspace(0, 2.5, 29)
turtime1 = pd.date_range(start='2017-10-14 02:10:00', freq='5min', 
                         periods=6, name='Start Time (UTC)')
turtime2 = pd.date_range(start='2017-10-14 02:40:00', freq='10min', 
                         periods=3, name='Start Time (UTC)')
turtime3 = pd.date_range(start='2017-10-14 03:10:00', freq='1min', 
                         periods=20, name='Start Time (UTC)')
turtime = turtime1.append(turtime2).append(turtime3)
turtime = turtime[:4].append(turtime[4:] + pd.Timedelta(1, unit='day'))
turdf = pd.DataFrame(dict(power=arr.round(2), 
                          duration=[300] * 6 + [600] * 3 + [60] * 20), index=turtime)

merge_nearest(sodf, turdf)
like image 147
user2561747 Avatar answered Nov 14 '22 20:11

user2561747