Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build fixed interval dataset from random interval dataset using stale data

Update: I've provided a brief analysis of the three answers at the bottom of the question text and explained my choices.

My Question: What is the most efficient method of building a fixed interval dataset from a random interval dataset using stale data?

Some background: The above is a common problem in statistics. Frequently, one has a sequence of observations occurring at random times. Call it Input. But one wants a sequence of observations occurring say, every 5 minutes. Call it Output. One of the most common methods to build this dataset is using stale data, i.e. set each observation in Output equal to the most recently occurring observation in Input.

So, here is some code to build example datasets:

TInput = 100;
TOutput = 50;

InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
Input = [InputTimeStamp, randn(TInput, 1)];

OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)';
Output = [OutputTimeStamp, NaN(TOutput, 1)];

Both datasets start at close to midnight at the turn of the millennium. However, the timestamps in Input occur at random intervals while the timestamps in Output occur at fixed intervals. For simplicity, I have ensured that the first observation in Input always occurs before the first observation in Output. Feel free to make this assumption in any answers.

Currently, I solve the problem like this:

sMax = size(Output, 1);
tMax = size(Input, 1);
s = 1;
t = 2;
%#Loop over input data
while t <= tMax
    if Input(t, 1) > Output(s, 1)
        %#If current obs in Input occurs after current obs in output then set current obs in output equal to previous obs in input
        Output(s, 2:end) = Input(t-1, 2:end);
        s = s + 1;
        %#Check if we've filled out all observations in output
        if s > sMax
            break
        end
        %#This step is necessary in case we need to use the same input observation twice in a row
        t = t - 1;
    end
    t = t + 1;
    if t > tMax
        %#If all remaining observations in output occur after last observation in input, then use last obs in input for all remaining obs in output 
        Output(s:end, 2:end) = Input(end, 2:end);
        break
    end
end

Surely there is a more efficient, or at least, more elegant way to solve this problem? As I mentioned, this is a common problem in statistics. Perhaps Matlab has some in-built function I'm not aware of? Any help would be much appreciated as I use this routine a LOT for some large datasets.

THE ANSWERS: Hi all, I've analyzed the three answers, and as they stand, Angainor's is the best.

ChthonicDaemon's answer, while clearly the easiest to implement, is really slow. This is true even when the conversion to a timeseries object is done outside of the speed test. I'm guessing the resample function has a lot of overhead at the moment. I am running 2011b, so it is possible Mathworks have improved it in the intervening time. Also, this method needs an additional line for the case where Output ends more than one observation after Input.

Rody's answer runs only slightly slower than Angainor's (unsurprising given they both employ the histc approach), however, it seems to have some problems. First, the method of assigning the last observation in Output is not robust to the last observation in Input occurring after the last observation in Output. This is an easy fix. But there is a second problem which I think stems from having InputTimeStamp as the first input to histc instead of the OutputTimeStamp adopted by Angainor. The problem emerges if you change OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001)'; to OutputTimeStamp = 730486.002 + (0:0.0001:TOutput * 0.0001 - 0.0001)'; when setting up the example inputs.

Angainor's appears robust to everything I threw at it, plus it was the fastest.

I did a lot of speed tests for different input specifications - the following numbers are fairly representative:

My naive loop: Elapsed time is 8.579535 seconds.

Angainor: Elapsed time is 0.661756 seconds.

Rody: Elapsed time is 0.913304 seconds.

ChthonicDaemon: Elapsed time is 22.916844 seconds.

I'm +1-ing Angainor's solution and marking the question solved.

like image 700
Colin T Bowers Avatar asked Oct 03 '12 04:10

Colin T Bowers


Video Answer


1 Answers

This "stale data" approach is known as a zero order hold in signal and timeseries fields. Searching for this quickly brings up many solutions. If you have Matlab 2012b, this is all built in to the timeseries class by using the resample function, so you would simply do

TInput = 100;
TOutput = 50;

InputTimeStamp = 730486 + cumsum(0.001 * rand(TInput, 1));
InputData = randn(TInput, 1);
InputTimeSeries = timeseries(InputData, InputTimeStamp);

OutputTimeStamp = 730486.002 + (0:0.001:TOutput * 0.001 - 0.001);
OutputTimeSeries = resample(InputTimeSeries, OutputTimeStamp, 'zoh'); % zoh stands for zero order hold
like image 133
chthonicdaemon Avatar answered Oct 10 '22 21:10

chthonicdaemon