How to efficiently fill a time series?

Tags:

My general problem is that I have a dataframe where columns correspond to feature values. There is also a date column in the dataframe. Each feature column may have missing NaN values. I want to fill a column with some fill logic such as "fill_mean" or "fill zero".

But I do not want to just apply the fill logic to the whole column because if one of the earlier values is a NaN, I do not want the average i fill for this specific NaN to be tainted by what the average was later on, when the model should have no knowledge about. Essentially it's the common problem of not leaking information about the future to your model - specifically when trying to fill my time series.

Anyway, I have simplified my problem to a few lines of code. This is my simplified attempt at the above general problem:

#assume ts_values is a time series where the first value in the list is the oldest value and the last value in the list is the most recent.
ts_values = [17.0, np.NaN, 12.0, np.NaN, 18.0]
nan_inds = np.argwhere(np.isnan(ts_values))
for nan_ind in nan_inds:
    nan_ind_value = nan_ind[0]
    ts_values[nan_ind_value] = np.mean(ts_values[0:nan_ind_value])

The output of the above script is:

[17.0, 17.0, 12.0, 15.333333333333334, 18.0]

which is exactly what I would expect.

My only issue with this is that it will be linear time with respect to the number of NaNs in the data set. Is there a way to do this in constant or log time where I don't iterate through the nan index values.

405

asked May 13 '19 00:05

sometimesiwritecode

1 Answers

if you want nan value replaced with a rolling mean (full window) on pandas series s, noting from WeNYoBen that this does not continue the rolling mean calculation during the fill. (so your 15.3 becomes a 12.0).

s.fillna(s.expanding(1).mean())

If you would like the rolling mean to update as nans are filled, this in-place numba solution may help

import numpy as np
import numba
from numba import jit


@jit(nopython=True)
def rolling_fill(a): 
    for i, e in enumerate(a):
        if np.isnan(e):
            a[i] = np.mean(a[:i])

ts_values = np.array([17.0, np.NaN, 12.0, np.NaN, 18.0])
rolling_fill(ts_values)
print(ts_values)

which gives

[17.         17.         12.         15.33333333 18.        ]

you could probably improve this by keeping a sum and not calling .mean everytime.

Time Complexity

This is not log or constant time as you must interpolate at most n-2 missing items from an array of length n which is O(n) - but it should be plenty optimized (by avoiding iteration in native python) and you can not do theoretically better, but lower level implementations of the above will make this dramatically faster.

EDIT: I originally misread and thought you were asking about interpolation

You would like to interpolate the series, and pandas support this directly.

>>> s = pd.Series([0, 1, np.nan, 5])
>>> s
0    0.0
1    1.0
2    NaN
3    5.0
dtype: float64
>>> s.interpolate()
0    0.0
1    1.0
2    3.0
3    5.0
dtype: float64

Or if you do not want to use pandas because your example is an ndarray, then use numpy.interp accordingly.

184

answered Oct 18 '22 09:10

modesitt

Related questions
                            
                                Django pass Haystack highlighter result to a view
                            
                                Difference between 3D-tensor and 4D-tensor for images input of DL Keras framework
                            
                                Is it possible to generate gremlin queries from bytecode in python
                            
                                Is there anyway I can set the working directory in airflow where my codes will run?
                            
                                Docker container fails to run, Error : python3: can't open file 'flask run --host=0.0.0.0': [Errno 2] No such file or directory
                            
                                Why does Pillow convert return colours outside the specified palette?
                            
                                How to smooth and plot x vs weighted average of y, weighted by x?
                            
                                How to create nested namespace packages for setuptools distribution
                            
                                AttributeError: type object 'spacy.syntax.nn_parser.array' has no attribute '__reduce_cython__' , (adding Paths to virtual environments)
                            
                                Understanding slots and getting its values in Alexa Skills Kit
                            
                                Google CoLab - How to run a jupyter notebook file that is in the 'Files' tab (i.e. /content/) of my CoLab environment
                            
                                How to downgrade the boto3 version in an AWS Lambda Function
                            
                                Convolving Across Channels in Keras CNN: Conv1D, Depthwise Separable Conv, CCCP?
                            
                                PyQt5 fails as "suitable UI Toolkit" for Mayavi with Python 3.6
                            
                                cumulative logical or within bins
                            
                                python get month of maximum value xarray
                            
                                setup.py ignores full path dependencies, instead looks for "best match" in pypi
                            
                                Retrieve file size for videos stored on Google Photos
                            
                                pylint W0622 (Redefining built-in) when overriding "standard" methods in subclasses
                            
                                Deal with Birtish summer time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently fill a time series?

Tags:

python

pandas

numpy

time-series

data-science

sometimesiwritecode

People also ask

1 Answers

modesitt

Recent Activity

Donate For Us