I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow. Take the aggregation function: <pre class="prettyprint"><code>function aggArray = aggregate(array, groupIndex, collapseFn) groups = unique(groupIndex, 'rows'); aggArray = nan(size(groups, 1), size(array, 2)); for iGr = 1:size(groups,1) grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2); for iSer = 1:size(array, 2) aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer)); end end end </code></pre> Note that both <code>array</code> and <code>groupIndex</code> can be 2D. Every column in <code>array</code> is an independent series to be aggregated, but the columns of <code>groupIndex</code> should be taken together (as a row) to specify a period. Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor: <pre class="prettyprint"><code>a = rand(20006,10); b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]); tic; aggregate(a, b, @sum); toc Elapsed time is 1.370001 seconds. </code></pre> Using the profiler, we can find out that the <code>grpIdx</code> line takes about 1/4 of the execution time (.28 s) and the <code>iSer</code> loop takes about 3/4 (1.17 s) of the total (1.48 s). Compare this with the period-indifferent case: <pre class="prettyprint"><code>tic; cumsum(a); toc Elapsed time is 0.000930 seconds. </code></pre> Is there a more efficient way to aggregate this data? <hr> <h3>Timing Results</h3> Taking each response and putting it in a separate function, here are the timing results I get with <code>timeit</code> with Matlab 2015b on Windows 7 with an Intel i7: <pre class="prettyprint"><code> original | 1.32451 felix1 | 0.35446 felix2 | 0.16432 divakar1 | 0.41905 divakar2 | 0.30509 divakar3 | 0.16738 matthewGunn1 | 0.02678 matthewGunn2 | 0.01977 </code></pre> <h3>Clarification on <code>groupIndex</code> </h3> An example of a 2D <code>groupIndex</code> would be where both the year number and week number are specified for a set of daily data covering 1980-2015: <pre class="prettyprint"><code>a2 = rand(36*52*5, 10); b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])']; </code></pre> Thus a "year-week" period are uniquely identified by a row of <code>groupIndex</code>. This is effectively handled through calling <code>unique(groupIndex, 'rows')</code> and taking the third output, so feel free to disregard this portion of the question.

Method #1 You can create the mask corresponding to <code>grIdx</code> across all <code>groups</code> in one go with <code>bsxfun(@eq,..)</code>. Now, for <code>collapseFn</code> as <code>@sum</code>, you can bring in <code>matrix-multiplication</code> and thus have a completely vectorized approach, like so - <pre class="prettyprint"><code>M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)) aggArray = M.'*array </code></pre> For <code>collapseFn</code> as <code>@mean</code>, you need to do a bit more work, as shown here - <pre class="prettyprint"><code>M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)) aggArray = bsxfun(@rdivide,M,sum(M,1)).'*array </code></pre> <hr> Method #2 In case you are working with a generic <code>collapseFn</code>, you can use the 2D mask <code>M</code> created with the previous method to index into the rows of <code>array</code>, thus changing the complexity from <code>O(n^2)</code> to <code>O(n)</code>. Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation - <pre class="prettyprint"><code>n = size(groups,1); M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)); out = zeros(n,size(array,2)); for iGr = 1:n out(iGr,:) = collapseFn(array(M(:,iGr),:),1); end </code></pre> Please note that the <code>1</code> in <code>collapseFn(array(M(:,iGr),:),1)</code> denotes the dimension along which <code>collapseFn</code> would be applied, so that <code>1</code> is essential there. <hr> Bonus By its name <code>groupIndex</code> seems like would hold integer values, which could be abused to have a more efficient <code>M</code> creation by considering each row of <code>groupIndex</code> as an indexing tuple and thus converting each row of <code>groupIndex</code> into a scalar and finally get a 1D array version of <code>groupIndex</code>. This must be more efficient as the datasize would be <code>0(n)</code> now. This <code>M</code> could be fed to all the approaches listed in this post. So, we would have <code>M</code> like so - <pre class="prettyprint"><code>dims = max(groupIndex,[],1); agg_dims = cumprod([1 dims(end:-1:2)]); [~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//' m = size(groupIndex,1); M = false(m,max(idx)); M((idx-1)*m + [1:m]') = 1; </code></pre>

Time series aggregation efficiency

Tags:

matlab

time-series

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.

Take the aggregation function:

function aggArray = aggregate(array, groupIndex, collapseFn)

groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));

for iGr = 1:size(groups,1)
    grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
    for iSer = 1:size(array, 2)
      aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
    end
end

end

Note that both array and groupIndex can be 2D. Every column in array is an independent series to be aggregated, but the columns of groupIndex should be taken together (as a row) to specify a period.

Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:

a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);

tic; aggregate(a, b, @sum); toc
Elapsed time is 1.370001 seconds.

Using the profiler, we can find out that the grpIdx line takes about 1/4 of the execution time (.28 s) and the iSer loop takes about 3/4 (1.17 s) of the total (1.48 s).

Compare this with the period-indifferent case:

tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.

Is there a more efficient way to aggregate this data?

Timing Results

Taking each response and putting it in a separate function, here are the timing results I get with timeit with Matlab 2015b on Windows 7 with an Intel i7:

    original | 1.32451
      felix1 | 0.35446
      felix2 | 0.16432
    divakar1 | 0.41905
    divakar2 | 0.30509
    divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977

Clarification on `groupIndex`

An example of a 2D groupIndex would be where both the year number and week number are specified for a set of daily data covering 1980-2015:

a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];

Thus a "year-week" period are uniquely identified by a row of groupIndex. This is effectively handled through calling unique(groupIndex, 'rows') and taking the third output, so feel free to disregard this portion of the question.

263

asked Nov 10 '15 17:11

David Kelley

2 Answers

Method #1

You can create the mask corresponding to grIdx across all groups in one go with bsxfun(@eq,..). Now, for collapseFn as @sum, you can bring in matrix-multiplication and thus have a completely vectorized approach, like so -

M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = M.'*array

For collapseFn as @mean, you need to do a bit more work, as shown here -

M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = bsxfun(@rdivide,M,sum(M,1)).'*array

Method #2

In case you are working with a generic collapseFn, you can use the 2D mask M created with the previous method to index into the rows of array, thus changing the complexity from O(n^2) to O(n). Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation -

n = size(groups,1);
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2));
out = zeros(n,size(array,2));
for iGr = 1:n
    out(iGr,:) = collapseFn(array(M(:,iGr),:),1);
end

Please note that the 1 in collapseFn(array(M(:,iGr),:),1) denotes the dimension along which collapseFn would be applied, so that 1 is essential there.

Bonus

By its name groupIndex seems like would hold integer values, which could be abused to have a more efficient M creation by considering each row of groupIndex as an indexing tuple and thus converting each row of groupIndex into a scalar and finally get a 1D array version of groupIndex. This must be more efficient as the datasize would be 0(n) now. This M could be fed to all the approaches listed in this post. So, we would have M like so -

dims = max(groupIndex,[],1);
agg_dims = cumprod([1 dims(end:-1:2)]);
[~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//'

m = size(groupIndex,1);
M = false(m,max(idx));
M((idx-1)*m + [1:m]') = 1;

110

answered Oct 13 '22 01:10

Divakar

Mex Function 1

HAMMER TIME: Mex function to crush it: The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from @Divakar is:

groups2 = unique(groupIndex); 
aggArray2 = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)).'*array;

Elapsed time is 0.589330 seconds.

Then my MEX function:

[groups3, aggArray3] = mg_aggregate(array, groupIndex, @(x) sum(x, 1));

Elapsed time is 0.079725 seconds.

Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15. Results also match original code.

Code to generate the test conditions:

array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);

For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.

Mex Approach 2

Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.

[unique_groups, map] = mg_getRowsWithKey(groupIndex);

results = zeros(length(unique_groups), size(array,2));

for iGr = 1:length(unique_groups)
   array_subset             = array(map{iGr},:);

   %// do your collapse function on array_subset. eg.
   results(iGr,:)           = sum(array_subset, 1);
end

When you do array(groups(1)==groupIndex,:) to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:) is far more efficient.

There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.

answered Oct 13 '22 02:10

Matthew Gunn

Related questions
                            
                                Is it possible to link the axes of two surface plots for 3d-rotation?
                            
                                Can I prevent Matlab from dynamically resizing a pre-allocated array?
                            
                                Fastest method for calculating convolution
                            
                                How to use Matlab's imresize in python
                            
                                How to set the opacity for a plot?
                            
                                How to apply cellfun (or arrayfun or structfun) with constant extra input arguments?
                            
                                Convert cell to double
                            
                                print n*m matrix in matlab
                            
                                Matlab Low Pass filter using fft
                            
                                How to check if a file exists in Matlab? [closed]
                            
                                How do I record video from a webcam in MATLAB?
                            
                                MATLAB is running out of memory but it should not be
                            
                                Matlab, remove elements from array which are less than average?
                            
                                Efficient Implementation of `im2col` and `col2im`
                            
                                What is benefit to use SVD for solving Ax=b
                            
                                How to determine where a number is printing from in MATLAB?
                            
                                Line of best fit scatter plot
                            
                                How to plot multiple lines with different markers
                            
                                Kmeans matlab "Empty cluster created at iteration 1" error
                            
                                Absolute error of ODE45 and Runge-Kutta methods compared with analytical solution

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Time series aggregation efficiency

Tags:

matlab

time-series

Timing Results

Clarification on `groupIndex`

David Kelley

People also ask

2 Answers

Divakar

Mex Function 1

Mex Approach 2

Matthew Gunn

Recent Activity

Donate For Us

Time series aggregation efficiency

Tags:

matlab

time-series

Timing Results

Clarification on groupIndex

David Kelley

People also ask

2 Answers

Divakar

Mex Function 1

Mex Approach 2

Matthew Gunn

Related questions

Recent Activity

Donate For Us

Clarification on `groupIndex`