I have data that is every 15 seconds. But, there are some values that are missing. These are not tagged with NaN, but simply are not present. How can I fill in those values?
I have tried to resample, but that also shifts my original data. So, why doesn't this work:
a=pd.Series([1.,3.,4.,3.,5.],['2016-05-25 00:00:35','2016-05-25 00:00:50','2016-05-25 00:01:05','2016-05-25 00:01:35','2016-05-25 00:02:05'])
a.index=pd.to_datetime(a.index)
a.resample('15S').mean()
In [368]: a
Out[368]:
2016-05-25 00:00:35 1.0
2016-05-25 00:00:50 3.0
2016-05-25 00:01:05 4.0
2016-05-25 00:01:35 3.0
2016-05-25 00:02:05 5.0
dtype: float64
It shows me this:
2016-05-25 00:00:30 1.0
2016-05-25 00:00:45 3.0
2016-05-25 00:01:00 4.0
2016-05-25 00:01:15 NaN
2016-05-25 00:01:30 3.0
2016-05-25 00:01:45 NaN
2016-05-25 00:02:00 5.0
Freq: 15S, dtype: float64
So, I no longer have a value at 00:35 or 00:50.
For my original larger data set, I also end up seeing many NaN value in large groups at the end of the resampled data.
What I would like to do resample my 15s data, to 15s, so whenever there is no data present for a particular time it should use the mean of the values around it to fill it in. Is there a way to do that?
Also, why does the time basis change when I resample? My original data starts at 00:00:35 and after resampling it starts at 00:30? It seems like it got shifted by 5 seconds.
In my example data, all it should have done is created an additional data entry at 00:01:50.
Edit
I realized that my data is slightly more complex then I had thought. The 'base' actually changes part way through it. If I use the solution below, then it works for part of the data, but then the values stop changing. For example:
a = pd.Series([1.,3.,4.,3.,5.,6.,7.,8.], ['2016-05-25 00:00:35','2016-05-25 00:00:50','2016-05-25 00:01:05','2016-05-25 00:01:35','2016-05-25 00:02:05','2016-05-25 00:03:00','2016-05-25 00:04:00','2016-05-25 00:06:00'])
In [79]: a
Out[79]:
2016-05-25 00:00:35 1.0
2016-05-25 00:00:50 3.0
2016-05-25 00:01:05 4.0
2016-05-25 00:01:35 3.0
2016-05-25 00:02:05 5.0
2016-05-25 00:03:00 6.0
2016-05-25 00:04:00 7.0
2016-05-25 00:06:00 8.0
dtype: float64
In [80]: a.index = pd.to_datetime(a.index)
In [81]: a.resample('15S', base=5).interpolate()
Out[81]:
2016-05-25 00:00:35 1.0
2016-05-25 00:00:50 3.0
2016-05-25 00:01:05 4.0
2016-05-25 00:01:20 3.5
2016-05-25 00:01:35 3.0
2016-05-25 00:01:50 4.0
2016-05-25 00:02:05 5.0
2016-05-25 00:02:20 5.0
2016-05-25 00:02:35 5.0
2016-05-25 00:02:50 5.0
2016-05-25 00:03:05 5.0
2016-05-25 00:03:20 5.0
2016-05-25 00:03:35 5.0
2016-05-25 00:03:50 5.0
2016-05-25 00:04:05 5.0
2016-05-25 00:04:20 5.0
2016-05-25 00:04:35 5.0
2016-05-25 00:04:50 5.0
2016-05-25 00:05:05 5.0
2016-05-25 00:05:20 5.0
2016-05-25 00:05:35 5.0
2016-05-25 00:05:50 5.0
Freq: 15S, dtype: float64
As you can see it stops interpolating after 2:05, and seems to ignore the data at 3:00,4:00 and 5:00.
Both @IanS and @piRSquared address the shifting of the base. As for filling NaN
s: pandas has methods for forward-filling (.ffill()
/.pad()
) and backward-filling (.bfill()
/.backfill()
), but not for taking the mean. A quick way of doing it is by taking the mean manually:
b = a.resample('15S', base=5)
(b.ffill() + b.bfill()) / 2
Output:
2016-05-25 00:00:35 1.0
2016-05-25 00:00:50 3.0
2016-05-25 00:01:05 4.0
2016-05-25 00:01:20 3.5
2016-05-25 00:01:35 3.0
2016-05-25 00:01:50 4.0
2016-05-25 00:02:05 5.0
Freq: 15S, dtype: float64
EDIT: I stand corrected: there is a built-in method: .interpolate()
.
a.resample('15S', base=5).interpolate()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With