Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill in time data in pandas

Tags:

python

pandas

I have data that is every 15 seconds. But, there are some values that are missing. These are not tagged with NaN, but simply are not present. How can I fill in those values?
I have tried to resample, but that also shifts my original data. So, why doesn't this work:

a=pd.Series([1.,3.,4.,3.,5.],['2016-05-25 00:00:35','2016-05-25 00:00:50','2016-05-25 00:01:05','2016-05-25 00:01:35','2016-05-25 00:02:05'])                                   
a.index=pd.to_datetime(a.index)
a.resample('15S').mean()

In [368]: a
Out[368]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:35    3.0
2016-05-25 00:02:05    5.0
dtype: float64

It shows me this:

2016-05-25 00:00:30    1.0
2016-05-25 00:00:45    3.0
2016-05-25 00:01:00    4.0
2016-05-25 00:01:15    NaN
2016-05-25 00:01:30    3.0
2016-05-25 00:01:45    NaN
2016-05-25 00:02:00    5.0
Freq: 15S, dtype: float64

So, I no longer have a value at 00:35 or 00:50.
For my original larger data set, I also end up seeing many NaN value in large groups at the end of the resampled data.
What I would like to do resample my 15s data, to 15s, so whenever there is no data present for a particular time it should use the mean of the values around it to fill it in. Is there a way to do that?
Also, why does the time basis change when I resample? My original data starts at 00:00:35 and after resampling it starts at 00:30? It seems like it got shifted by 5 seconds.
In my example data, all it should have done is created an additional data entry at 00:01:50.


Edit

I realized that my data is slightly more complex then I had thought. The 'base' actually changes part way through it. If I use the solution below, then it works for part of the data, but then the values stop changing. For example:

a = pd.Series([1.,3.,4.,3.,5.,6.,7.,8.], ['2016-05-25 00:00:35','2016-05-25 00:00:50','2016-05-25 00:01:05','2016-05-25 00:01:35','2016-05-25 00:02:05','2016-05-25 00:03:00','2016-05-25 00:04:00','2016-05-25 00:06:00'])                                   

In [79]: a
Out[79]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:35    3.0
2016-05-25 00:02:05    5.0
2016-05-25 00:03:00    6.0
2016-05-25 00:04:00    7.0
2016-05-25 00:06:00    8.0
dtype: float64

In [80]: a.index = pd.to_datetime(a.index)

In [81]: a.resample('15S', base=5).interpolate()
Out[81]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:20    3.5
2016-05-25 00:01:35    3.0
2016-05-25 00:01:50    4.0
2016-05-25 00:02:05    5.0
2016-05-25 00:02:20    5.0
2016-05-25 00:02:35    5.0
2016-05-25 00:02:50    5.0
2016-05-25 00:03:05    5.0
2016-05-25 00:03:20    5.0
2016-05-25 00:03:35    5.0
2016-05-25 00:03:50    5.0
2016-05-25 00:04:05    5.0
2016-05-25 00:04:20    5.0
2016-05-25 00:04:35    5.0
2016-05-25 00:04:50    5.0
2016-05-25 00:05:05    5.0
2016-05-25 00:05:20    5.0
2016-05-25 00:05:35    5.0
2016-05-25 00:05:50    5.0
Freq: 15S, dtype: float64

As you can see it stops interpolating after 2:05, and seems to ignore the data at 3:00,4:00 and 5:00.

like image 494
Adam Avatar asked Dec 25 '22 01:12

Adam


1 Answers

Both @IanS and @piRSquared address the shifting of the base. As for filling NaNs: pandas has methods for forward-filling (.ffill()/.pad()) and backward-filling (.bfill()/.backfill()), but not for taking the mean. A quick way of doing it is by taking the mean manually:

b = a.resample('15S', base=5)
(b.ffill() + b.bfill()) / 2

Output:

2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:20    3.5
2016-05-25 00:01:35    3.0
2016-05-25 00:01:50    4.0
2016-05-25 00:02:05    5.0
Freq: 15S, dtype: float64

EDIT: I stand corrected: there is a built-in method: .interpolate().

a.resample('15S', base=5).interpolate()
like image 114
A. Garcia-Raboso Avatar answered Dec 26 '22 13:12

A. Garcia-Raboso