Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpolate and fill pandas dataframe with datetime index

Tags:

python

pandas

Hi I'm trying to interpolate a Dataframe where I have a datetimeIndex index.

Here's the data

res = pd.DataFrame(cursor.execute("SELECT DATETIME,VALUE FROM {} WHERE DATETIME > ? AND DATETIME < ?".format(table),[start,end]).fetchall(),columns=['date','value'])
res.set_index('date',inplace=True)

which produces

2013-01-31 00:00:00   517  
2012-12-31 00:00:00   263  
2012-11-30 00:00:00  1917  
2012-10-31 00:00:00   391  
2012-09-30 00:00:00   782  
2012-08-31 00:00:00   700  
2012-07-31 00:00:00   799  
2012-06-30 00:00:00   914  
2012-05-31 00:00:00   141  
2012-04-30 00:00:00   342  
2012-03-31 00:00:00   199  
2012-02-29 00:00:00   533  
2012-01-31 00:00:00  1393  
2011-12-31 00:00:00   497  
2011-11-30 00:00:00  1457  
2011-10-31 00:00:00   997  
2011-09-30 00:00:00   533  
2011-08-31 00:00:00   626  
2011-07-31 00:00:00  1933  
2011-06-30 00:00:00  4248  
2011-05-31 00:00:00  1248  
2011-04-30 00:00:00   904  
2011-03-31 00:00:00  3280  
2011-02-28 00:00:00   390  
2011-01-31 00:00:00   601  
2010-12-31 00:00:00   423  
2010-11-30 00:00:00   748  
2010-10-31 00:00:00   433  
2010-09-30 00:00:00   734  
2010-08-31 00:00:00   845  
2010-07-31 00:00:00  1693  
2010-06-30 00:00:00  2742  
2010-05-31 00:00:00   669  

This is all non contiguous. I want to have a daily value so, want to fill in the missing values using some kind of interpolation.

First tried to set the index and then interpolate.

new_index = pd.date_range(date(2010,1,1),date(2014,1,31),freq='D')
df2 = res.reindex(new_index) # This returns NaN
df2.interpolate('cubic') # Fails with error TypeError: Cannot interpolate with all NaNs.

What I would hope to get back is a dataframe with each date value between 2010-2014, with a interpolated value calculated from the points surrounding it.

It seems like there probably is a way to do this simply, but I'm not sure what.

like image 710
Delta_Fore Avatar asked May 05 '15 14:05

Delta_Fore


People also ask

How does interpolate work in pandas?

Interpolation is one such method of filling data. Interpolation is a technique in Python used to estimate unknown data points between two known data points. Interpolation is mostly used to impute missing values in the dataframe or series while pre-processing data.

How do pandas interpolate missing values?

You can interpolate missing values ( NaN ) in pandas. DataFrame and Series with interpolate() . This article describes the following contents. Use dropna() and fillna() to remove missing values NaN or to fill them with a specific value.

What is DatetimeIndex pandas?

DatetimeIndex [source] Immutable ndarray of datetime64 data, represented internally as int64, and which can be boxed to Timestamp objects that are subclasses of datetime and carry metadata such as frequency information.

How does linear interpolation work in pandas?

1) Linear Interpolation in forwarding DirectionThe linear method ignores the index and treats missing values as equally spaced and finds the best point to fit the missing value after previous points. If the missing value is at first index then it will leave it as Nan. let's apply it to our dataframe.


2 Answers

Here's one way to do it.

First get a new index from max min of df.index dates

In [152]: df_reindexed = df.reindex(pd.date_range(start=df.index.min(),
                                                  end=df.index.max(),
                                                  freq='1D'))                  

Then use interpolate(method='linear') on the series to get values.

In [153]: df_reindexed.interpolate(method='linear')                                                                      
Out[153]:                                                                                                                
                  Value                                                                                                  
2010-05-31   669.000000                                                                                                  
2010-06-01   738.100000                                                                                                  
2010-06-02   807.200000                                                                                                  
2010-06-03   876.300000                                                                                                  
2010-06-04   945.400000                                                                                                  
2010-06-05  1014.500000                                                                                                  
...                                                                                                  
2013-01-25   467.838710                                                                                                  
2013-01-26   476.032258                                                                                                  
2013-01-27   484.225806                                                                                                  
2013-01-28   492.419355                                                                                                  
2013-01-29   500.612903                                                                                                  
2013-01-30   508.806452                                                                                                  
2013-01-31   517.000000                                                                                                  

[977 rows x 1 columns]                                                                                                   
like image 178
Zero Avatar answered Oct 11 '22 01:10

Zero


Just as an add on to @JohnGalt's answer, you could also use resample which is slightly more convenient than reindex here:

df.resample('D').interpolate('cubic')

                  value
date                   
2010-05-31   669.000000
2010-06-01   830.400272
2010-06-02   983.988431
2010-06-03  1129.919466
2010-06-04  1268.348368
2010-06-05  1399.430127
2010-06-06  1523.319734

...

2010-06-25  2716.850752
2010-06-26  2729.445324
2010-06-27  2738.102544
2010-06-28  2742.977403
2010-06-29  2744.224892
2010-06-30  2742.000000
2010-07-01  2736.454249
2010-07-02  2727.725284
2010-07-03  2715.947277
like image 22
JohnE Avatar answered Oct 11 '22 02:10

JohnE