Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently fill a time series?

My general problem is that I have a dataframe where columns correspond to feature values. There is also a date column in the dataframe. Each feature column may have missing NaN values. I want to fill a column with some fill logic such as "fill_mean" or "fill zero".

But I do not want to just apply the fill logic to the whole column because if one of the earlier values is a NaN, I do not want the average i fill for this specific NaN to be tainted by what the average was later on, when the model should have no knowledge about. Essentially it's the common problem of not leaking information about the future to your model - specifically when trying to fill my time series.

Anyway, I have simplified my problem to a few lines of code. This is my simplified attempt at the above general problem:

#assume ts_values is a time series where the first value in the list is the oldest value and the last value in the list is the most recent.
ts_values = [17.0, np.NaN, 12.0, np.NaN, 18.0]
nan_inds = np.argwhere(np.isnan(ts_values))
for nan_ind in nan_inds:
    nan_ind_value = nan_ind[0]
    ts_values[nan_ind_value] = np.mean(ts_values[0:nan_ind_value])

The output of the above script is:

[17.0, 17.0, 12.0, 15.333333333333334, 18.0]

which is exactly what I would expect.

My only issue with this is that it will be linear time with respect to the number of NaNs in the data set. Is there a way to do this in constant or log time where I don't iterate through the nan index values.

like image 405
sometimesiwritecode Avatar asked May 13 '19 00:05

sometimesiwritecode


People also ask

How do you fill gaps in a time series?

A powerful approach to filling gaps in time series is Optimal Interpolation. This method is also known as Kriging. The advantage of this approach is that it provides a smoothed response based on the characteristics of the surrounding data and the known structure of the errors.

How do I reduce panda memory usage?

Ways to optimize memory in Pandas Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.


1 Answers

if you want nan value replaced with a rolling mean (full window) on pandas series s, noting from WeNYoBen that this does not continue the rolling mean calculation during the fill. (so your 15.3 becomes a 12.0).

s.fillna(s.expanding(1).mean())

If you would like the rolling mean to update as nans are filled, this in-place numba solution may help

import numpy as np
import numba
from numba import jit


@jit(nopython=True)
def rolling_fill(a): 
    for i, e in enumerate(a):
        if np.isnan(e):
            a[i] = np.mean(a[:i])

ts_values = np.array([17.0, np.NaN, 12.0, np.NaN, 18.0])
rolling_fill(ts_values)
print(ts_values)

which gives

[17.         17.         12.         15.33333333 18.        ]

you could probably improve this by keeping a sum and not calling .mean everytime.

Time Complexity

This is not log or constant time as you must interpolate at most n-2 missing items from an array of length n which is O(n) - but it should be plenty optimized (by avoiding iteration in native python) and you can not do theoretically better, but lower level implementations of the above will make this dramatically faster.


EDIT: I originally misread and thought you were asking about interpolation

You would like to interpolate the series, and pandas support this directly.

>>> s = pd.Series([0, 1, np.nan, 5])
>>> s
0    0.0
1    1.0
2    NaN
3    5.0
dtype: float64
>>> s.interpolate()
0    0.0
1    1.0
2    3.0
3    5.0
dtype: float64

Or if you do not want to use pandas because your example is an ndarray, then use numpy.interp accordingly.

like image 184
modesitt Avatar answered Oct 18 '22 09:10

modesitt