Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Resampling and filling missing data in pandas

I have a raw dataset that looks like this:

df = pd.DataFrame({'speed': [66.8,67,67.1,70,69],
                   'time': ['2017-08-09T05:41:30.168Z', '2017-08-09T05:41:31.136Z', '2017-08-09T05:41:31.386Z', '2017-08-09T05:41:31.103Z','2017-08-09T05:41:35.563Z' ]})

I could do some processing on it to make it look like (removed microseconds):

df['time']= pd.to_datetime(df.time)
df['time'] = df['time'].apply(lambda x: x.replace(microsecond=0))

>>> df
   speed                time
0   66.8 2017-08-09 05:41:30
1   67.0 2017-08-09 05:41:31
2   67.1 2017-08-09 05:41:31
3   70.0 2017-08-09 05:41:31
4   69.0 2017-08-09 05:41:35

I need to now resample the data so that any entries that arrived at the same timestamp are averaged together, and for the timestamps that did not receive any data, use the last available value. Like:

   speed                time
0   66.80 2017-08-09 05:41:30
1   68.03 2017-08-09 05:41:31
2   70.00 2017-08-09 05:41:32
3   70.00 2017-08-09 05:41:33
4   70.00 2017-08-09 05:41:34
5   69.00 2017-08-09 05:41:35

I understand this might involve the use of groupby and resample, but being a beginner I find myself struggling with these. Any ideas on how to proceed?

I have tried this but I am getting wrong results:

df.groupby( [df["time"].dt.second]).mean()
          speed
time           
30    66.800000
31    68.033333
35    69.000000
like image 934
CaramelFix Avatar asked Aug 09 '17 09:08

CaramelFix


People also ask

How do I resample data in pandas?

Resample Pandas time-series data. The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

How do you backfill in pandas?

bfill() is used to backward fill the missing values in the dataset. It will backward fill the NaN values that are present in the pandas dataframe.


1 Answers

In [279]: df.resample('1S', on='time').mean().ffill()
Out[279]:
                         speed
time
2017-08-09 05:41:30  66.800000
2017-08-09 05:41:31  68.033333
2017-08-09 05:41:32  68.033333
2017-08-09 05:41:33  68.033333
2017-08-09 05:41:34  68.033333
2017-08-09 05:41:35  69.000000
like image 90
MaxU - stop WAR against UA Avatar answered Sep 23 '22 01:09

MaxU - stop WAR against UA