I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.
Take the aggregation function:
function aggArray = aggregate(array, groupIndex, collapseFn)
groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));
for iGr = 1:size(groups,1)
grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
for iSer = 1:size(array, 2)
aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
end
end
end
Note that both array
and groupIndex
can be 2D. Every column in array
is an independent series to be aggregated, but the columns of groupIndex
should be taken together (as a row) to specify a period.
Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:
a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
tic; aggregate(a, b, @sum); toc
Elapsed time is 1.370001 seconds.
Using the profiler, we can find out that the grpIdx
line takes about 1/4 of the execution time (.28 s) and the iSer
loop takes about 3/4 (1.17 s) of the total (1.48 s).
Compare this with the period-indifferent case:
tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.
Is there a more efficient way to aggregate this data?
Taking each response and putting it in a separate function, here are the timing results I get with timeit
with Matlab 2015b on Windows 7 with an Intel i7:
original | 1.32451
felix1 | 0.35446
felix2 | 0.16432
divakar1 | 0.41905
divakar2 | 0.30509
divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977
groupIndex
An example of a 2D groupIndex
would be where both the year number and week number are specified for a set of daily data covering 1980-2015:
a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];
Thus a "year-week" period are uniquely identified by a row of groupIndex
. This is effectively handled through calling unique(groupIndex, 'rows')
and taking the third output, so feel free to disregard this portion of the question.
Time aggregation is the aggregation of all data points for a single resource over a specified period (the granularity). Data aggregations in Resource Time Series reports are of the time aggregation type.
If a dimension does not explicitly have a hierarchy, then level 0 is no aggregation, and level 1 is “all.” The scales so defined along each dimension define a coordinate system for uniquely identifying each view in a product graph.
Why is Data Aggregation Important? Data collected by enterprises is vital to help them make better decisions, understand consumer behavior, improve process efficiency and finally, understand performance, be it of the company or its products. To this end, simply collecting high quality and reliable data is not enough.
Aggregation platforms take care of the collection, processing, and sometimes even the presentation of data. It's an essential part of data integration. Data aggregation helps summarize data from different, disparate and multiple sources. It increases the value of information.
Method #1
You can create the mask corresponding to grIdx
across all
groups
in one go with bsxfun(@eq,..)
. Now, for collapseFn
as @sum
, you can bring in matrix-multiplication
and thus have a completely vectorized approach, like so -
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = M.'*array
For collapseFn
as @mean
, you need to do a bit more work, as shown here -
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = bsxfun(@rdivide,M,sum(M,1)).'*array
Method #2
In case you are working with a generic collapseFn
, you can use the 2D mask M
created with the previous method to index into the rows of array
, thus changing the complexity from O(n^2)
to O(n)
. Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation -
n = size(groups,1);
M = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2));
out = zeros(n,size(array,2));
for iGr = 1:n
out(iGr,:) = collapseFn(array(M(:,iGr),:),1);
end
Please note that the 1
in collapseFn(array(M(:,iGr),:),1)
denotes the dimension along which collapseFn
would be applied, so that 1
is essential there.
Bonus
By its name groupIndex
seems like would hold integer values, which could be abused to have a more efficient M
creation by considering each row of groupIndex
as an indexing tuple and thus converting each row of groupIndex
into a scalar and finally get a 1D array version of groupIndex
. This must be more efficient as the datasize would be 0(n)
now. This M
could be fed to all the approaches listed in this post. So, we would have M
like so -
dims = max(groupIndex,[],1);
agg_dims = cumprod([1 dims(end:-1:2)]);
[~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//'
m = size(groupIndex,1);
M = false(m,max(idx));
M((idx-1)*m + [1:m]') = 1;
HAMMER TIME: Mex function to crush it: The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from @Divakar is:
groups2 = unique(groupIndex);
aggArray2 = squeeze(all(bsxfun(@eq,groupIndex,permute(groups,[3 2 1])),2)).'*array;
Elapsed time is 0.589330 seconds.
Then my MEX function:
[groups3, aggArray3] = mg_aggregate(array, groupIndex, @(x) sum(x, 1));
Elapsed time is 0.079725 seconds.
Testing that we get the same answer: norm(groups2-groups3)
returns 0
and norm(aggArray2 - aggArray3)
returns 2.3959e-15
. Results also match original code.
Code to generate the test conditions:
array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.
Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.
[unique_groups, map] = mg_getRowsWithKey(groupIndex);
results = zeros(length(unique_groups), size(array,2));
for iGr = 1:length(unique_groups)
array_subset = array(map{iGr},:);
%// do your collapse function on array_subset. eg.
results(iGr,:) = sum(array_subset, 1);
end
When you do array(groups(1)==groupIndex,:)
to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:)
is far more efficient.
There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With