Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas time series interpolation and regularization

I am using Python Pandas for the first time. I have 5-min lag traffic data in csv format:

...
2015-01-04 08:29:05,271238
2015-01-04 08:34:05,329285
2015-01-04 08:39:05,-1
2015-01-04 08:44:05,260260
2015-01-04 08:49:05,263711
...

There are several issues:

  • for some timestamps there's missing data (-1)
  • missing entries (also 2/3 consecutive hours)
  • the frequency of the observations is not exactly 5 minutes, but actually loses some seconds once in a while

I would like to obtain a regular time series, so with entries every (exactly) 5 minutes (and no missing valus). I have successfully interpolated the time series with the following code to approximate the -1 values with this code:

ts = pd.TimeSeries(values, index=timestamps)
ts.interpolate(method='cubic', downcast='infer')

How can I both interpolate and regularize the frequency of the observations? Thank you all for the help.

like image 932
riccamini Avatar asked May 29 '15 12:05

riccamini


People also ask

Is pandas good for time series?

Pandas' time series tools are very useful when data is timestamped. Timestamp is the pandas equivalent of python's Datetime. It's the type used for the entries that make up a DatetimeIndex, and other timeseries-oriented data structures in pandas.

How does interpolate work in pandas?

Interpolation is one such method of filling data. Interpolation is a technique in Python used to estimate unknown data points between two known data points. Interpolation is mostly used to impute missing values in the dataframe or series while pre-processing data.

What is interpolation time series?

Interpolation is mostly used while working with time-series data because in time-series data we like to fill missing values with previous one or two values. for example, suppose temperature, now we would always prefer to fill today's temperature with the mean of the last 2 days, not with the mean of the month.


1 Answers

Change the -1s to NaNs:

ts[ts==-1] = np.nan

Then resample the data to have a 5 minute frequency.

ts = ts.resample('5T')

Note that, by default, if two measurements fall within the same 5 minute period, resample averages the values together.

Finally, you could linearly interpolate the time series according to the time:

ts = ts.interpolate(method='time')

Since it looks like your data already has roughly a 5-minute frequency, you might need to resample at a shorter frequency so cubic or spline interpolation can smooth out the curve:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

values = [271238, 329285, -1, 260260, 263711]
timestamps = pd.to_datetime(['2015-01-04 08:29:05',
                             '2015-01-04 08:34:05',
                             '2015-01-04 08:39:05',
                             '2015-01-04 08:44:05',
                             '2015-01-04 08:49:05'])

ts = pd.Series(values, index=timestamps)
ts[ts==-1] = np.nan
ts = ts.resample('T').mean()

ts.interpolate(method='spline', order=3).plot()
ts.interpolate(method='time').plot()
lines, labels = plt.gca().get_legend_handles_labels()
labels = ['spline', 'time']
plt.legend(lines, labels, loc='best')
plt.show()

enter image description here

like image 108
unutbu Avatar answered Sep 22 '22 13:09

unutbu