pandas.DataFrame.interpolate
allows to fill missing data by interpolating neighboring values. Among the arguments it accepts, two of them seem relevant for this question: method
and limit
.
method
: among other possible values, accepts "linear"
and "time"
. The difference between them is that "linear"
assumes equidistant rows and ignores the index, while "time"
interpolates taking into account the time intervals defined by the index>>> df = pd.DataFrame({'vals': [11, np.nan, np.nan, 12, np.nan, 22]},
... index=pd.to_datetime(['2020-02-02 11:00', '2020-02-02 11:06', '2020-02-02 11:30',
... '2020-02-02 12:00', '2020-02-02 16:00', '2020-02-02 22:00']))
>>> df.assign(interp_linear=df.interpolate(method='linear'),
... interp_time=df.interpolate(method='time'))
vals interp_linear interp_time
2020-02-02 11:00:00 11.0 11.000000 11.0
2020-02-02 11:06:00 NaN 11.333333 11.1
2020-02-02 11:30:00 NaN 11.666667 11.5
2020-02-02 12:00:00 12.0 12.000000 12.0
2020-02-02 16:00:00 NaN 17.000000 16.0
2020-02-02 22:00:00 22.0 22.000000 22.0
limit
: allows you to limit the length of the gaps to be filled with interpolated values in terms of the number of missing values. It expects integer values and is defined as the maximum number of consecutive NaNs to fill. While this behavior is absolutely reasonable for the method="linear"
case, it seems limited for the method="time"
case.When filling missing values with interpolation, is there any way to limit the length of the gap to be filled by specifying the maximum temporal gap rather than the number of missing values?
Specifically, I would like a behaviour equivalent to interpolate with method='time'
but filling only missing values that are not further away than a given timedelta from the last and the next non-missing values. An alternative criteria could also be filling all missing values surrounded by non-missing values that are not further away than a given timedelta. Subtly different options, but both of them good.
Something equivalent to this:
>>> df.interpolate(method='time', limit='2h')
vals
2020-02-02 11:00:00 11.0
2020-02-02 11:06:00 11.1
2020-02-02 11:30:00 11.5
2020-02-02 12:00:00 12.0
2020-02-02 16:00:00 NaN
2020-02-02 22:00:00 22.0
One can create and use a mask to pick off the values required:
df = pd.DataFrame({'vals': [11, np.nan, np.nan, 12, np.nan, 22]},
index=pd.to_datetime(['2020-02-02 11:00', '2020-02-02 11:06', '2020-02-02 11:30',
'2020-02-02 12:00', '2020-02-02 16:00', '2020-02-02 22:00']))
df['ts'] = df.index
mask = df['ts'].diff() < datetime.timedelta(hours=2)
df["masked_interp"] = df["vals"].interpolate(method='time')[mask]
df["vals"] = df["vals"].combine_first(df["masked_interp"])
Giving:
vals
2020-02-02 11:00:00 11.0
2020-02-02 11:06:00 11.1
2020-02-02 11:30:00 11.5
2020-02-02 12:00:00 12.0
2020-02-02 16:00:00 NaN
2020-02-02 22:00:00 22.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With