Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to apply a time-aware limit when doing use time interpolation in pandas?

Tags:

python

pandas

pandas.DataFrame.interpolate allows to fill missing data by interpolating neighboring values. Among the arguments it accepts, two of them seem relevant for this question: method and limit.

  • method: among other possible values, accepts "linear" and "time". The difference between them is that "linear" assumes equidistant rows and ignores the index, while "time" interpolates taking into account the time intervals defined by the index
>>> df = pd.DataFrame({'vals': [11, np.nan, np.nan, 12, np.nan, 22]},
...                   index=pd.to_datetime(['2020-02-02 11:00', '2020-02-02 11:06', '2020-02-02 11:30',
...                                         '2020-02-02 12:00', '2020-02-02 16:00', '2020-02-02 22:00']))
>>> df.assign(interp_linear=df.interpolate(method='linear'),
...           interp_time=df.interpolate(method='time'))

                     vals  interp_linear  interp_time
2020-02-02 11:00:00  11.0      11.000000         11.0
2020-02-02 11:06:00   NaN      11.333333         11.1
2020-02-02 11:30:00   NaN      11.666667         11.5
2020-02-02 12:00:00  12.0      12.000000         12.0
2020-02-02 16:00:00   NaN      17.000000         16.0
2020-02-02 22:00:00  22.0      22.000000         22.0
  • limit: allows you to limit the length of the gaps to be filled with interpolated values in terms of the number of missing values. It expects integer values and is defined as the maximum number of consecutive NaNs to fill. While this behavior is absolutely reasonable for the method="linear" case, it seems limited for the method="time" case.

Question

When filling missing values with interpolation, is there any way to limit the length of the gap to be filled by specifying the maximum temporal gap rather than the number of missing values?

Specifically, I would like a behaviour equivalent to interpolate with method='time' but filling only missing values that are not further away than a given timedelta from the last and the next non-missing values. An alternative criteria could also be filling all missing values surrounded by non-missing values that are not further away than a given timedelta. Subtly different options, but both of them good.

Something equivalent to this:

>>> df.interpolate(method='time', limit='2h')

                     vals
2020-02-02 11:00:00  11.0
2020-02-02 11:06:00  11.1
2020-02-02 11:30:00  11.5
2020-02-02 12:00:00  12.0
2020-02-02 16:00:00   NaN
2020-02-02 22:00:00  22.0
like image 477
mgab Avatar asked Nov 16 '22 09:11

mgab


1 Answers

One can create and use a mask to pick off the values required:

df = pd.DataFrame({'vals': [11, np.nan, np.nan, 12, np.nan, 22]},
                   index=pd.to_datetime(['2020-02-02 11:00', '2020-02-02 11:06', '2020-02-02 11:30',
                                         '2020-02-02 12:00', '2020-02-02 16:00', '2020-02-02 22:00']))
df['ts'] = df.index
mask = df['ts'].diff() < datetime.timedelta(hours=2)
df["masked_interp"] = df["vals"].interpolate(method='time')[mask]
df["vals"] = df["vals"].combine_first(df["masked_interp"])

Giving:

                    vals
2020-02-02 11:00:00 11.0
2020-02-02 11:06:00 11.1
2020-02-02 11:30:00 11.5
2020-02-02 12:00:00 12.0
2020-02-02 16:00:00 NaN
2020-02-02 22:00:00 22.0
like image 141
Chris Seeling Avatar answered Dec 31 '22 21:12

Chris Seeling