Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpolate (or extrapolate) only small gaps in pandas dataframe

I have a pandas DataFrame with time as index (1 min Freq) and several columns worth of data. Sometimes the data contains NaN. If so, I want to interpolate only if the gap is not longer than 5 Minutes. In this case this would be a maximum of 5 consecutive NaNs. The data may look like this (several test cases, which show the problems):

import numpy as np
import pandas as pd
from datetime import datetime

start = datetime(2014,2,21,14,50)
data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
                         data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
                               'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
                               'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
                               'd': [np.NaN]*8,
                               'e': [np.NaN]*7 + [2330.3],
                               'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
                               'g': [2330.3] + [np.NaN]*7,
                               'h': [2330.3] + [np.NaN]*6 + [2777.7]})

It reads like this:

In [147]: data
Out[147]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

I am aware of data.interpolate() but it has several flaws, as it produces this result, which is good for the columns a-e, but for the columns f-h it fails for different reasons::

                         a      b      c   d       e       f       g  \
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3   
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0  2330.3   
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3  2330.3   
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3  2330.3   
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3  2330.3   

                               h  
2014-02-21 14:50:00  2330.300000  
2014-02-21 14:51:00  2394.214286  
2014-02-21 14:52:00  2458.128571  
2014-02-21 14:53:00  2522.042857  
2014-02-21 14:54:00  2585.957143  
2014-02-21 14:55:00  2649.871429  
2014-02-21 14:56:00  2713.785714  
2014-02-21 14:57:00  2777.700000 

f) The gap consists of 4 minutes worth of NaNs in the beginning, they should be replaced by that value 2763.0 (i.e. extrapolating backwards in time)

g) The gap is longer than 5 minutes but still it gets extrapolated

h) The gap is longer than 5 minutes but still the gap is interpolated.

I understand those reasons, of course I nowhere specified that it should not interpolate longer gaps than 5 minutes. I understand that interpolate only extrapolates forward in time, but I want it to also extrapolate backward in time. Is there any known methods I can use for my problem, without reinventing the wheel?

Edit: The method data.interpolate accepts the input parameter limit, which defines the maximum number of consecutive NaNs to be substituted by interpolation. But this still interpolates up to the limit, but I want to go on with all NaNs in that case.

like image 369
Nras Avatar asked May 29 '15 15:05

Nras


People also ask

How do pandas interpolate missing values?

You can interpolate missing values ( NaN ) in pandas. DataFrame and Series with interpolate() . This article describes the following contents. Use dropna() and fillna() to remove missing values NaN or to fill them with a specific value.

How does interpolate work in pandas?

interpolate() function is basically used to fill NA values in the dataframe or series. But, this is a very powerful function to fill the missing values. It uses various interpolation technique to fill the missing values rather than hard-coding the value.

What are types of interpolation to fill missing values in series data?

We will now look at three different methods of interpolating the missing read values: forward-filling, backward-filling and interpolating.


1 Answers

So here is a mask that ought to solve the problem. Just interpolate and then apply the mask to reset appropriate values to NaN. Honestly, this was a bit more work than I realized it would be because I had to loop through each column but then groupby didn't quite work without me providing some dummy columns like 'ones'.

Anyway, I can explain if anything is unclear but really only a couple of the lines are somewhat hard to understand. See here for a little bit more of an explanation of the trick on the df['new'] line or just print out individual lines to better see what is going on.

mask = data.copy()
for i in list('abcdefgh'):
    df = pd.DataFrame( data[i] )
    df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
    df['ones'] = 1
    mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()

In [7]: data
Out[7]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

In [8]: mask
Out[8]: 
                        a     b     c      d      e     f      g      h
2014-02-21 14:50:00  True  True  True  False  False  True   True   True
2014-02-21 14:51:00  True  True  True  False  False  True  False  False
2014-02-21 14:52:00  True  True  True  False  False  True  False  False
2014-02-21 14:53:00  True  True  True  False  False  True  False  False
2014-02-21 14:54:00  True  True  True  False  False  True  False  False
2014-02-21 14:55:00  True  True  True  False  False  True  False  False
2014-02-21 14:56:00  True  True  True  False  False  True  False  False
2014-02-21 14:57:00  True  True  True  False   True  True  False   True

It's easy from there if you don't do anything fancier with respect to extrapolation:

In [9]: data.interpolate().bfill()[mask]
Out[9]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN  2763.0  2330.3  2330.3
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

Edit to add: Here's a faster (about 2x on this sample data) and slightly simpler way, by moving some stuff outside of the loop:

mask = data.copy()
grp = ((mask.notnull() != mask.shift().notnull()).cumsum())
grp['ones'] = 1
for i in list('abcdefgh'):
    mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()
like image 52
JohnE Avatar answered Oct 10 '22 05:10

JohnE