Update: I've provided a brief analysis of the three answers at the bottom of the question text and explained my choices.
My Question: What is the most efficient method of building a fixed interval dataset from a random interval dataset using stale data?
Some background: The above is a common problem in statistics. Frequently, one has a sequence of observations occurring at random times. Call it Input
. But one wants a sequence of observations occurring say, every 5 minutes. Call it Output
. One of the most common methods to build this dataset is using stale data, i.e. set each observation in Output
equal to the most recently occurring observation in Input
.
So, here is some code to build example datasets:
TInput = 100;
TOutput = 50;
InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
Input = [InputTimeStamp, randn(TInput, 1)];
OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';
Output = [OutputTimeStamp, NaN(TOutput, 1)];
Both datasets start at close to midnight at the turn of the millennium. However, the timestamps in Input
occur at random intervals while the timestamps in Output
occur at fixed intervals. For simplicity, I have ensured that the first observation in Input
always occurs before the first observation in Output
. Feel free to make this assumption in any answers.
Currently, I solve the problem like this:
sMax = size(Output, 1);
tMax = size(Input, 1);
s = 1;
t = 2;
%#Loop over input data
while t <= tMax
if Input(t, 1) > Output(s, 1)
%#If current obs in Input occurs after current obs in output then set current obs in output equal to previous obs in input
Output(s, 2:end) = Input(t-1, 2:end);
s = s + 1;
%#Check if we've filled out all observations in output
if s > sMax
break
end
%#This step is necessary in case we need to use the same input observation twice in a row
t = t - 1;
end
t = t + 1;
if t > tMax
%#If all remaining observations in output occur after last observation in input, then use last obs in input for all remaining obs in output
Output(s:end, 2:end) = Input(end, 2:end);
break
end
end
Surely there is a more efficient, or at least, more elegant way to solve this problem? As I mentioned, this is a common problem in statistics. Perhaps Matlab has some in-built function I'm not aware of? Any help would be much appreciated as I use this routine a LOT for some large datasets.
THE ANSWERS: Hi all, I've analyzed the three answers, and as they stand, Angainor's is the best.
ChthonicDaemon's answer, while clearly the easiest to implement, is really slow. This is true even when the conversion to a timeseries
object is done outside of the speed test. I'm guessing the resample
function has a lot of overhead at the moment. I am running 2011b, so it is possible Mathworks have improved it in the intervening time. Also, this method needs an additional line for the case where Output
ends more than one observation after Input
.
Rody's answer runs only slightly slower than Angainor's (unsurprising given they both employ the histc
approach), however, it seems to have some problems. First, the method of assigning the last observation in Output
is not robust to the last observation in Input
occurring after the last observation in Output
. This is an easy fix. But there is a second problem which I think stems from having InputTimeStamp
as the first input to histc
instead of the OutputTimeStamp
adopted by Angainor. The problem emerges if you change OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';
to OutputTimeStamp = 730486.002 + (0:0.0001:TOutput * 0.0001 - 0.0001)';
when setting up the example inputs.
Angainor's appears robust to everything I threw at it, plus it was the fastest.
I did a lot of speed tests for different input specifications - the following numbers are fairly representative:
My naive loop: Elapsed time is 8.579535 seconds.
Angainor: Elapsed time is 0.661756 seconds.
Rody: Elapsed time is 0.913304 seconds.
ChthonicDaemon: Elapsed time is 22.916844 seconds.
I'm +1-ing Angainor's solution and marking the question solved.
This "stale data" approach is known as a zero order hold in signal and timeseries fields. Searching for this quickly brings up many solutions. If you have Matlab 2012b, this is all built in to the timeseries
class by using the resample
function, so you would simply do
TInput = 100;
TOutput = 50;
InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
InputData = randn(TInput, 1);
InputTimeSeries = timeseries(InputData, InputTimeStamp);
OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001);
OutputTimeSeries = resample(InputTimeSeries, OutputTimeStamp, 'zoh'); % zoh stands for zero order hold
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With