Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I vectorize code that runs a function on subsets of a larger matrix?

Let's assume I have the following 9 x 5 matrix:

myArray = [
   54.7    8.1   81.7   55.0   22.5
   29.6   92.9   79.4   62.2   17.0
   74.4   77.5   64.4   58.7   22.7
   18.8   48.6   37.8   20.7   43.5
   68.6   43.5   81.1   30.1   31.1
   18.3   44.6   53.2   47.0   92.3
   36.8   30.6   35.0   23.0   43.0
   62.5   50.8   93.9   84.4   18.4
   78.0   51.0   87.5   19.4   90.4
];

I have 11 "subsets" of this matrix and I need to run a function (let's say max) on each of these subsets. The subsets can be identified with the following matirx of logicals (identified column-wise, not row-wise):

myLogicals = logical([
    0 1 0 1 1
    1 1 0 1 1
    1 1 0 0 0
    0 1 0 1 1
    1 0 1 1 1
    1 1 1 1 0
    0 1 1 0 1
    1 1 0 0 1
    1 1 0 0 1
]);

or via linear indexing:

starts = [2 5 8 10 15 23 28 31 37 40 43]; #%index start of each subset
ends =   [3 6 9 13 18 25 29 33 38 41 45]; #%index end of each subset

such that the first subset is 2:3, the second is 5:6, and so on.

I can find the max of each subset and store it in a vector as follows:

finalAnswers = NaN(11,1); 
for n=1:length(starts) #%i.e. 1 through the number of subsets
    finalAnswers(n) = max(myArray(starts(n):ends(n)));
end

After the loop runs, finalAnswers contains the maximum value of each of the data subsets:

74.4  68.6  78.0  92.9  51.0  81.1  62.2  47.0  22.5  43.5  90.4

Is it possible to obtain the same result without the use of a for loop? In other words, can this code be vectorized? Would such an approach be more efficient than the current one?


EDIT: I did some testing of the proposed solutions. The data I used was a 1,510 x 2,185 matrix with 10,103 subsets that varied in length from 2 to 916 with a standard deviation of subset length of 101.92.

I wrapped each solution in tic;for k=1:1000 [code here] end; toc; and here are the results:

  • for loop approach --- Elapsed time is 16.237400 seconds.
  • Shai's approach --- Elapsed time is 153.707076 seconds.
  • Dan's approach --- Elapsed time is 44.774121 seconds.
  • Divakar's approach #2 --- Elapsed time is 127.621515 seconds.

Notes:

  • I also tried benchmarking Dan's approach by wrapping the k=1:1000 for loop around just the accumarray line (since the rest could be theoretically run just once). In this case the time was 28.29 seconds.
  • Benchmarking Shai's approach, while leaving the lb = ... line out of the k loop, the time was 113.48 seconds.
  • When I ran Divakar's code, I got Non-singleton dimensions of the two input arrays must match each other. errors for the bsxfun lines. I "fixed" this by using conjugate transposition (the apostrophe operator ') on trade_starts(1:starts_extent) and intv(1:starts_extent) in the lines of code calling bsxfun. I'm not sure why this error was occuring...

I'm not sure if my benchmarking setup is correct, but it appears that the for loop actually runs the fastest in this case.

like image 792
Alarik Avatar asked Sep 29 '14 13:09

Alarik


2 Answers

One approach is to use accumarray. Unfortunately in order to do that we first need to "label" your logical matrix. Here is a convoluted way of doing that if you don't have the image processing toolbox:

sz=size(myLogicals);
s_ind(sz(1),sz(2))=0;
%// OR: s_ind = zeros(size(myLogicals))

s_ind(starts) = 1;
labelled = cumsum(s_ind(:)).*myLogicals(:);        

So that just does what Shai's bwlabeln implementation does (but this will be 1-by-numel(myLogicals) in shape as opposed to size(myLogicals) in shape)

Now you can use accumarray:

accumarray(labelled(myLogicals), myArray(myLogicals), [], @max)

or else it may be faster to try

result = accumarray(labelled+1, myArray(:), [], @max);
result = result(2:end)

This is fully vectorized, but is it worth it? You'll have to do speed tests against your loop solution to know.

like image 113
Dan Avatar answered Oct 29 '22 17:10

Dan


Use bwlabeln with a vertical connectivity:

lb = bwlabeln( myLogicals, [0 1 0; 0 1 0; 0 1 0] );

Now you have a label 1..11 for each region.

To get max value you can use regionprops

props = regionprops( lb, myArray, 'MaxIntensity' );
finalAnswers = [props.MaxIntensity];

You can use regionprops to get some other properties of each subset, but it is not too general.
If you wish to apply a more general function to each region, e.g., median, you can use accumarray:

finalAnswer = accumarray( lb( myLogicals ), myArray( myLogicals ), [], @median );
like image 35
Shai Avatar answered Oct 29 '22 16:10

Shai