I have two dataframes. Each one has a timestamp index representing the start time and a duration value (in seconds) which could be used to calculate the end time. The time interval and duration is different for each dataframe, and could vary within each dataframe as well.
duration param1
Start Time (UTC)
2017-10-14 02:00:31 60 95
2017-10-14 02:01:31 60 34
2017-10-14 02:02:31 60 10
2017-10-14 02:03:31 60 44
2017-10-14 02:04:31 60 63
2017-10-14 02:05:31 60 52
...
duration param2
Start Time (UTC)
2017-10-14 02:00:00 300 93
2017-10-14 02:05:00 300 95
2017-10-14 02:10:00 300 91
...
I want to join these two dataframes such the index and columns of the first are maintained but the parameter values from the second are copied to it using the following scheme:
For each row in the first dataframe, assign the param2 value from the first row in the (sorted) second dataframe which contains 50% or more of the time range.
Example output below:
duration param1 param2
Start Time (UTC)
2017-10-14 02:00:31 60 95 93
2017-10-14 02:01:31 60 34 93
2017-10-14 02:02:31 60 10 93
2017-10-14 02:03:31 60 44 93
2017-10-14 02:04:31 60 63 95
2017-10-14 02:05:31 60 52 95
...
Here's an approach which mostly solves this, but with some simplifications to the problem. The code can probably be extended to solve those too, as described. This solution is also robust to timeseries gaps (skipped index values) as well as timeseries blanks (intentional NaNs).
import numpy as np
import pandas as pd
def merge_nearest(df_left, df_right):
"""
Assumptions:
1. constant duration in df_left # could be solved
with a `df_left.groupby('duration')` which calls
this function on each group
2. which is always less than or equal to the variable
duration of df_right # could probably just
programatically get the min
"""
df_left = df_left.sort_index()
df_right = df_right.sort_index()
min_duration = df_left['duration'].min() # seconds
# merge nearest start times together, still blank df_right
# values for the rest of each interval's duration
matched = pd.merge_asof(df_left, df_right, left_index=True,
right_index=True, suffixes=('_left', '_right'),
tolerance=pd.Timedelta(min_duration / 2, unit='s'),
direction='nearest')
# fancy forward fill that uses a variable timedelta-based limit
righteous_cols = [col + '_right' if col in df_left.columns else col \
for col in df_right.columns]
store_index = matched.index
duration_string = f'{int(np.round(min_duration))}s'
index_gaps_to_blanks = pd.date_range(start=matched.index.min().round(duration_string),
end=matched.index.max().round(duration_string),
freq=duration_string)
rounded = matched.index.round(duration_string)
tolerances = matched.index - rounded
matched.index = rounded
matched = matched.reindex(index=index_gaps_to_blanks)
# this ffill is just to group properly
grouped = matched.fillna(method='ffill').groupby('duration_right', sort=False)
for duration, index_group in grouped.groups.items():
fill_limit = int(np.round(duration / min_duration)) - 1
if fill_limit > 0:
matched.loc[index_group, righteous_cols] = \
matched.loc[index_group, righteous_cols].fillna(method='ffill',
limit=fill_limit)
matched = matched.reindex(index=store_index, method='nearest', tolerance=np.abs(tolerances))
return matched
Testing it out:
# sample data
# 1 minute timeseries with 1 day gap
arr = np.linspace(25, 55, 100)
sotime = pd.date_range(start='2017-10-14 02:00:31', freq='1min',
periods=100, name='Start Time (UTC)')
sotime = sotime[:27].append(sotime[27:] + pd.Timedelta(1, unit='day'))
sodf = pd.DataFrame(dict(level=arr.round(2), duration=[60.0] * 100), index=sotime)
# an offset 5, 10, 1 minute timeseries also with an offset 1 day gap
arr = np.linspace(0, 2.5, 29)
turtime1 = pd.date_range(start='2017-10-14 02:10:00', freq='5min',
periods=6, name='Start Time (UTC)')
turtime2 = pd.date_range(start='2017-10-14 02:40:00', freq='10min',
periods=3, name='Start Time (UTC)')
turtime3 = pd.date_range(start='2017-10-14 03:10:00', freq='1min',
periods=20, name='Start Time (UTC)')
turtime = turtime1.append(turtime2).append(turtime3)
turtime = turtime[:4].append(turtime[4:] + pd.Timedelta(1, unit='day'))
turdf = pd.DataFrame(dict(power=arr.round(2),
duration=[300] * 6 + [600] * 3 + [60] * 20), index=turtime)
merge_nearest(sodf, turdf)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With