Hi I'm trying to interpolate a Dataframe where I have a datetimeIndex index.
Here's the data
res = pd.DataFrame(cursor.execute("SELECT DATETIME,VALUE FROM {} WHERE DATETIME > ? AND DATETIME < ?".format(table),[start,end]).fetchall(),columns=['date','value'])
res.set_index('date',inplace=True)
which produces
2013-01-31 00:00:00 517
2012-12-31 00:00:00 263
2012-11-30 00:00:00 1917
2012-10-31 00:00:00 391
2012-09-30 00:00:00 782
2012-08-31 00:00:00 700
2012-07-31 00:00:00 799
2012-06-30 00:00:00 914
2012-05-31 00:00:00 141
2012-04-30 00:00:00 342
2012-03-31 00:00:00 199
2012-02-29 00:00:00 533
2012-01-31 00:00:00 1393
2011-12-31 00:00:00 497
2011-11-30 00:00:00 1457
2011-10-31 00:00:00 997
2011-09-30 00:00:00 533
2011-08-31 00:00:00 626
2011-07-31 00:00:00 1933
2011-06-30 00:00:00 4248
2011-05-31 00:00:00 1248
2011-04-30 00:00:00 904
2011-03-31 00:00:00 3280
2011-02-28 00:00:00 390
2011-01-31 00:00:00 601
2010-12-31 00:00:00 423
2010-11-30 00:00:00 748
2010-10-31 00:00:00 433
2010-09-30 00:00:00 734
2010-08-31 00:00:00 845
2010-07-31 00:00:00 1693
2010-06-30 00:00:00 2742
2010-05-31 00:00:00 669
This is all non contiguous. I want to have a daily value so, want to fill in the missing values using some kind of interpolation.
First tried to set the index and then interpolate.
new_index = pd.date_range(date(2010,1,1),date(2014,1,31),freq='D')
df2 = res.reindex(new_index) # This returns NaN
df2.interpolate('cubic') # Fails with error TypeError: Cannot interpolate with all NaNs.
What I would hope to get back is a dataframe with each date value between 2010-2014, with a interpolated value calculated from the points surrounding it.
It seems like there probably is a way to do this simply, but I'm not sure what.
Interpolation is one such method of filling data. Interpolation is a technique in Python used to estimate unknown data points between two known data points. Interpolation is mostly used to impute missing values in the dataframe or series while pre-processing data.
You can interpolate missing values ( NaN ) in pandas. DataFrame and Series with interpolate() . This article describes the following contents. Use dropna() and fillna() to remove missing values NaN or to fill them with a specific value.
DatetimeIndex [source] Immutable ndarray of datetime64 data, represented internally as int64, and which can be boxed to Timestamp objects that are subclasses of datetime and carry metadata such as frequency information.
1) Linear Interpolation in forwarding DirectionThe linear method ignores the index and treats missing values as equally spaced and finds the best point to fit the missing value after previous points. If the missing value is at first index then it will leave it as Nan. let's apply it to our dataframe.
Here's one way to do it.
First get a new index from max min
of df.index
dates
In [152]: df_reindexed = df.reindex(pd.date_range(start=df.index.min(),
end=df.index.max(),
freq='1D'))
Then use interpolate(method='linear')
on the series to get values.
In [153]: df_reindexed.interpolate(method='linear')
Out[153]:
Value
2010-05-31 669.000000
2010-06-01 738.100000
2010-06-02 807.200000
2010-06-03 876.300000
2010-06-04 945.400000
2010-06-05 1014.500000
...
2013-01-25 467.838710
2013-01-26 476.032258
2013-01-27 484.225806
2013-01-28 492.419355
2013-01-29 500.612903
2013-01-30 508.806452
2013-01-31 517.000000
[977 rows x 1 columns]
Just as an add on to @JohnGalt's answer, you could also use resample
which is slightly more convenient than reindex
here:
df.resample('D').interpolate('cubic')
value
date
2010-05-31 669.000000
2010-06-01 830.400272
2010-06-02 983.988431
2010-06-03 1129.919466
2010-06-04 1268.348368
2010-06-05 1399.430127
2010-06-06 1523.319734
...
2010-06-25 2716.850752
2010-06-26 2729.445324
2010-06-27 2738.102544
2010-06-28 2742.977403
2010-06-29 2744.224892
2010-06-30 2742.000000
2010-07-01 2736.454249
2010-07-02 2727.725284
2010-07-03 2715.947277
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With