I have timestamped sensor data. Because of technical details, I get data from the sensors at approximately one minute intervals. The data may look like this:
tstamp val
0 2016-09-01 00:00:00 57
1 2016-09-01 00:01:00 57
2 2016-09-01 00:02:23 57
3 2016-09-01 00:03:04 57
4 2016-09-01 00:03:58 58
5 2016-09-01 00:05:00 60
Now, essentially, I would be extremely happy if I got all data at the exact minute, but I don't. The only way to conserve the distribution and have data at each minute is to interpolate. For example, between row indexes 1 and 2 there are 83 seconds, and the natural way to get a value at the exact minute is to interpolate between the two rows of data (in this case, it is 57, but that is not the case everywhere).
Right now, my approach is to do the following:
date = pd.to_datetime(df['measurement_tstamp'].iloc[0].date())
ts_d = df['measurement_tstamp'].dt.hour * 60 * 60 +\
df['measurement_tstamp'].dt.minute * 60 +\
df['measurement_tstamp'].dt.second
ts_r = np.arange(0, 24*60*60, 60)
data = scipy.interpolate.interp1d(x=ts_d, y=df['speed'].values)(ts_r)
req = pd.Series(data, index=pd.to_timedelta(ts_r, unit='s'))
req.index = date + req.index
But this feels rather drawn out and long to me. There are excellent pandas methods that do resampling, rounding, etc. I have been reading them all day, but it turns out that nothing does interpolation just the way I want it. resample
works like a groupby
and averages time points that fall together. fillna
does interpolation, but not after resample
has already altered the data by averaging.
Am I missing something, or is my approach the best there is?
For simplicity, assume that I group the data by day, and by sensor, so only a 24 hour period from one sensor is interpolated at a time.
Resampling is used to either increase the sample rate (make the image larger) or decrease it (make the image smaller). Interpolation is the process of calculating values between sample points. So, if you resample an image you can use interpolation to do it.
Linear interpolation works the best when we have many points.
Upsampling: Where you increase the frequency of the samples, such as from minutes to seconds. Downsampling: Where you decrease the frequency of the samples, such as from days to months.
d = df.set_index('tstamp')
t = d.index
r = pd.date_range(t.min().date(), periods=24*60, freq='T')
d.reindex(t.union(r)).interpolate('index').ix[r]
Note, periods=24*60
works on daily data, not on the sample provided in the question. For that sample, periods=6
will work.
5 years later, pandas has changed a bit (mostly the ix
function is deprecated). Anyway, I've rewritten piRSquared's answer to work with current pandas versions and also improved on the date range issue that the answer had:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({"tstamp": [
datetime(2016, 9, 1, 0, 0, 0),
datetime(2016, 9, 1, 0, 1, 0),
datetime(2016, 9, 1, 0, 2, 23),
datetime(2016, 9, 1, 0, 3, 4),
datetime(2016, 9, 1, 0, 3, 58),
datetime(2016, 9, 1, 0, 5, 0)],
"val": [57, 57, 57, 57, 58, 60]})
d = df.set_index('tstamp')
t = d.index
r = pd.date_range(t.min(), t.max(), freq='T')
d = d.reindex(t.union(r)).interpolate('index').loc[r]
d:
val
2016-09-01 00:00:00 57.000000
2016-09-01 00:01:00 57.000000
2016-09-01 00:02:00 57.000000
2016-09-01 00:03:00 57.000000
2016-09-01 00:04:00 58.064516
2016-09-01 00:05:00 60.000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With