My general problem is that I have a dataframe where columns correspond to feature values. There is also a date column in the dataframe. Each feature column may have missing NaN values. I want to fill a column with some fill logic such as "fill_mean" or "fill zero".
But I do not want to just apply the fill logic to the whole column because if one of the earlier values is a NaN, I do not want the average i fill for this specific NaN to be tainted by what the average was later on, when the model should have no knowledge about. Essentially it's the common problem of not leaking information about the future to your model - specifically when trying to fill my time series.
Anyway, I have simplified my problem to a few lines of code. This is my simplified attempt at the above general problem:
#assume ts_values is a time series where the first value in the list is the oldest value and the last value in the list is the most recent.
ts_values = [17.0, np.NaN, 12.0, np.NaN, 18.0]
nan_inds = np.argwhere(np.isnan(ts_values))
for nan_ind in nan_inds:
nan_ind_value = nan_ind[0]
ts_values[nan_ind_value] = np.mean(ts_values[0:nan_ind_value])
The output of the above script is:
[17.0, 17.0, 12.0, 15.333333333333334, 18.0]
which is exactly what I would expect.
My only issue with this is that it will be linear time with respect to the number of NaNs in the data set. Is there a way to do this in constant or log time where I don't iterate through the nan index values.
A powerful approach to filling gaps in time series is Optimal Interpolation. This method is also known as Kriging. The advantage of this approach is that it provides a smoothed response based on the characteristics of the surrounding data and the known structure of the errors.
Ways to optimize memory in Pandas Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.
if you want nan
value replaced with a rolling mean (full window) on pandas series s
, noting from WeNYoBen that this does not continue the rolling mean calculation during the fill. (so your 15.3 becomes a 12.0).
s.fillna(s.expanding(1).mean())
If you would like the rolling mean to update as nans are filled, this in-place numba
solution may help
import numpy as np
import numba
from numba import jit
@jit(nopython=True)
def rolling_fill(a):
for i, e in enumerate(a):
if np.isnan(e):
a[i] = np.mean(a[:i])
ts_values = np.array([17.0, np.NaN, 12.0, np.NaN, 18.0])
rolling_fill(ts_values)
print(ts_values)
which gives
[17. 17. 12. 15.33333333 18. ]
you could probably improve this by keeping a sum and not calling .mean
everytime.
Time Complexity
This is not log
or constant
time as you must interpolate at most n-2
missing items from an array of length n
which is O(n)
- but it should be plenty optimized (by avoiding iteration in native python) and you can not do theoretically better, but lower level implementations of the above will make this dramatically faster.
EDIT: I originally misread and thought you were asking about interpolation
You would like to interpolate
the series, and pandas support this directly.
>>> s = pd.Series([0, 1, np.nan, 5])
>>> s
0 0.0
1 1.0
2 NaN
3 5.0
dtype: float64
>>> s.interpolate()
0 0.0
1 1.0
2 3.0
3 5.0
dtype: float64
Or if you do not want to use pandas
because your example is an ndarray
, then use numpy.interp
accordingly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With