I am building my first large-scale MATLAB program, and I've managed to write original vectorized code for everything so for until I came to trying to create an image representing vector density in stereographic projection. After a couple failed attempts I went to the Mathworks file exchange site and found an open source program which fits my needs courtesy of Malcolm Mclean. With a test matrix his function produces something like this: <img src="https://i.stack.imgur.com/abLHc.jpg" alt="Vector density in Stereographic Projection"> And while this is almost exactly what I wanted, his code relies on a triply nested for-loop. On my workstation a test data matrix of size 25000x2 took 65 seconds in this section of code. This is unacceptable since I will be scaling up to a data matrices of size 500000x2 in my project. So far I've been able to vectorize the innermost loop (which was the longest/worst loop), but I would like to continue and be rid of the loops entirely if possible. Here is Malcolm's original code that I need to vectorize: <pre class="prettyprint"><code>dmap = zeros(height, width); % height, width: scalar with default value = 32 for ii = 0: height - 1 % 32 iterations of this loop yi = limits(3) + ii * deltay + deltay/2; % limits(3) & deltay: scalars for jj = 0 : width - 1 % 32 iterations of this loop xi = limits(1) + jj * deltax + deltax/2; % limits(1) & deltax: scalars dd = 0; for kk = 1: length(x) % up to 500,000 iterations in this loop dist2 = (x(kk) - xi)^2 + (y(kk) - yi)^2; dd = dd + 1 / ( dist2 + fudge); % fudge is a scalar end dmap(ii+1,jj+1) = dd; end end </code></pre> And here it is with the changes I've already made to the innermost loop (which was the biggest drain on efficiency). This cuts the time from 65 seconds down to 12 seconds on my machine for the same test matrix, which is better but still far slower than I would like. <pre class="prettyprint"><code> dmap = zeros(height, width); for ii = 0: height - 1 yi = limits(3) + ii * deltay + deltay/2; for jj = 0 : width - 1 xi = limits(1) + jj * deltax + deltax/2; dist2 = (x - xi) .^ 2 + (y - yi) .^ 2; dmap(ii + 1, jj + 1) = sum(1 ./ (dist2 + fudge)); end end</code></pre> So my main question, are there any further changes I can make to optimize this code? Or even an alternative method to approach the problem? I've considered using C++ or F# instead of MATLAB for this section of the program, and I may do so if I cannot get to a reasonable efficiency level with the MATLAB code. Please also note that at this point I don't have ANY additional toolboxes, if I did then I know this would be trivial (using hist3 from the statistics toolbox for example).

Mem consuming solution <pre class="prettyprint"><code>yi = limits(3) + deltay * ( 1:height ) - .5 * deltay; xi = limits(1) + deltax * ( 1:width ) - .5 * deltax; dx = bsxfun( @minus, x(:), xi ) .^ 2; dy = bsxfun( @minus, y(:), yi ) .^ 2; dist2 = bsxfun( @plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) ); dmap = sum( 1./(dist2 + fudge ) , 3 ); </code></pre> <h3>EDIT</h3> handling extremely large <code>x</code> and <code>y</code> by breaking the operation into blocks: <pre class="prettyprint"><code>blockSize = 50000; % process up to XX elements at once dmap = 0; yi = limits(3) + deltay * ( 1:height ) - .5 * deltay; xi = limits(1) + deltax * ( 1:width ) - .5 * deltax; bi = 1; while bi <= numel(x) % take a block of x and y bx = x( bi:min(end, bi + blockSize - 1) ); by = y( bi:min(end, bi + blockSize - 1) ); dx = bsxfun( @minus, bx(:), xi ) .^ 2; dy = bsxfun( @minus, by(:), yi ) .^ 2; dist2 = bsxfun( @plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) ); dmap = dmap + sum( 1./(dist2 + fudge ) , 3 ); bi = bi + blockSize; end </code></pre>

vectorize/optimize this code in MATLAB?

Tags:

performance

vectorization

image-processing

matlab

I am building my first large-scale MATLAB program, and I've managed to write original vectorized code for everything so for until I came to trying to create an image representing vector density in stereographic projection. After a couple failed attempts I went to the Mathworks file exchange site and found an open source program which fits my needs courtesy of Malcolm Mclean. With a test matrix his function produces something like this:

Vector density in Stereographic Projection

And while this is almost exactly what I wanted, his code relies on a triply nested for-loop. On my workstation a test data matrix of size 25000x2 took 65 seconds in this section of code. This is unacceptable since I will be scaling up to a data matrices of size 500000x2 in my project.

So far I've been able to vectorize the innermost loop (which was the longest/worst loop), but I would like to continue and be rid of the loops entirely if possible. Here is Malcolm's original code that I need to vectorize:

dmap = zeros(height, width); % height, width: scalar with default value = 32
for ii = 0: height - 1          % 32 iterations of this loop
    yi = limits(3) + ii * deltay + deltay/2; % limits(3) & deltay: scalars
    for jj = 0 : width - 1      % 32 iterations of this loop
        xi = limits(1) + jj * deltax + deltax/2; % limits(1) & deltax: scalars
        dd = 0;
        for kk = 1: length(x)   % up to 500,000 iterations in this loop
            dist2 = (x(kk) - xi)^2 + (y(kk) - yi)^2;
            dd = dd + 1 / ( dist2 + fudge); % fudge is a scalar
        end
        dmap(ii+1,jj+1) = dd;
    end
end

And here it is with the changes I've already made to the innermost loop (which was the biggest drain on efficiency). This cuts the time from 65 seconds down to 12 seconds on my machine for the same test matrix, which is better but still far slower than I would like.

     dmap = zeros(height, width);
    for ii = 0: height - 1
        yi = limits(3) + ii * deltay + deltay/2;
        for jj = 0 : width - 1
            xi = limits(1) + jj * deltax + deltax/2;
            dist2 = (x - xi) .^ 2 + (y - yi) .^ 2;
            dmap(ii + 1, jj + 1) = sum(1 ./ (dist2 + fudge));
        end
    end

So my main question, are there any further changes I can make to optimize this code? Or even an alternative method to approach the problem? I've considered using C++ or F# instead of MATLAB for this section of the program, and I may do so if I cannot get to a reasonable efficiency level with the MATLAB code.

Please also note that at this point I don't have ANY additional toolboxes, if I did then I know this would be trivial (using hist3 from the statistics toolbox for example).

284

asked May 07 '13 08:05

user2307487

2 Answers

Mem consuming solution

yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width  ) - .5 * deltax;
dx = bsxfun( @minus, x(:), xi ) .^ 2;
dy = bsxfun( @minus, y(:), yi ) .^ 2;
dist2 = bsxfun( @plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = sum( 1./(dist2 + fudge ) , 3 );

EDIT

handling extremely large x and y by breaking the operation into blocks:

blockSize = 50000; % process up to XX elements at once
dmap = 0;
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width  ) - .5 * deltax;
bi = 1;
while bi <= numel(x)
    % take a block of x and y
    bx = x( bi:min(end, bi + blockSize - 1) );
    by = y( bi:min(end, bi + blockSize - 1) );
    dx = bsxfun( @minus, bx(:), xi ) .^ 2;
    dy = bsxfun( @minus, by(:), yi ) .^ 2;
    dist2 = bsxfun( @plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
    dmap = dmap + sum( 1./(dist2 + fudge ) , 3 );
    bi = bi + blockSize;
end

123

answered Oct 04 '22 12:10

Shai

This is a good example of why starting a loop from 1 matters. The only reason that ii and jj are initiated at 0 is to kill the ii * deltay and jj * deltax terms which however introduces sequentiality in the dmap indexing, preventing parallelization.

Now, by rewriting the loops you could use parfor() after opening a matlabpool:

dmap = zeros(height, width);
yi   = limits(3) + deltay*(1:height) - .5*deltay;
matlabpool 8
parfor ii = 1: height
    for jj = 1: width
        xi    = limits(1) + (jj-1) * deltax + deltax/2;
        dist2 = (x - xi) .^ 2 + (y - yi(ii)) .^ 2;
        dmap(ii, jj) = sum(1 ./ (dist2 + fudge));
    end
end
matlabpool close

Keep in mind that opening and closing the pool has significant overhead (10 seconds on my Intel Core Duo T9300, vista 32 Matlab 2013a).

PS. I am not sure whether the inner loop instead of the outer one can be meaningfully parallelized. You can try to switch the parfor to the inner one and compare speeds (I would recommend going for the big matrix immediately since you are already running in 12 seconds and the overhead is almost as big).

answered Oct 04 '22 13:10

Oleg

Related questions
                            
                                Rounded UIScrollView Performance
                            
                                Is there any difference regarding performance of private, protected, public and internal methods in C# classes?
                            
                                Why does MySQL's performance decrease when queries are executed in parallel?
                            
                                Android: Understanding OnDrawFrame, FPS and VSync (OpenGL ES 2.0)
                            
                                How does HasCallStack influence the performance of a normal branch in Haskell?
                            
                                Is there any way of throttling CPU/Memory of a process?
                            
                                ASP.NET websites under IIS 7.5 (Windows 7) running extremely slow
                            
                                AJAX Polling Frequency - To long poll or not to long poll?
                            
                                Improving ASP.NET MVC startup performance
                            
                                How to update Android ListView with dynamic data in real time?
                            
                                WPF slow performance - many DataItem=null binding warnings
                            
                                Java Sound dramatically slower after JVM 8 update
                            
                                Visual Studio 2015 slow
                            
                                Need to expand an inventory journal (log) pandas dataframe to include all dates per product id
                            
                                In PHP, how many DB calls per page is okay?
                            
                                lockCanvas() really slow
                            
                                Designing high-performance State Machine in Java
                            
                                rake assets:precompile taking extremely long to complete
                            
                                When does a slow MySQL query on a given connection affect other connections?
                            
                                Faster alternative to Python's zipfile module?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With