I need to calculate <code>f(x)=exp(A*x)</code> repeatedly for a tiny, variable column vector <code>x</code> and a huge, constant matrix <code>A</code> (many rows, few columns). In other words, the <code>x</code> are few, but the <code>A*x</code> are many. My problem dimensions are such that <code>A*x</code> takes about as much runtime as the exp() part. Apart from Taylor expansion and pre-calculating a range of values <code>exp(y)</code> (assuming known the range <code>y</code> of values of <code>A*x</code>), which I haven't managed to speed up considerably (while maintaining accuracy) with respect to what MATLAB is doing on its own, I am thinking about analytically restating the problem in order to be able to precalculate some values. For example, I find that <code>exp(A*x)_i = exp(\sum_j A_ij x_j) = \prod_j exp(A_ij x_j) = \prod_j exp(A_ij)^x_j</code> This would allow me to precalculate <code>exp(A)</code> once, but the required exponentiation in the loop is as costly as the original <code>exp()</code> function call, and the multiplications (\prod) have to be carried out in addition. Is there any other idea that I could follow, or solutions within MATLAB that I may have missed? Edit: some more details <code>A</code> is 26873856 by 81 in size (yes, it's that huge), so <code>x</code> is 81 by 1. <code>nnz(A) / numel(A)</code> is <code>0.0012</code>, <code>nnz(A*x) / numel(A*x)</code> is <code>0.0075</code>. I already use a sparse matrix to represent <code>A</code>, however, exp() of a sparse matrix is not sparse any longer. So in fact, I store <code>x</code> non-sparse and I calculate <code>exp(full(A*x))</code> which turned out to be as fast/slow as <code>full(exp(A*x))</code> (I think <code>A*x</code> is non-sparse anyway, since x is non-sparse.) <code>exp(full(A*sparse(x)))</code> is a way to have a sparse <code>A*x</code>, but is slower. Even slower variants are <code>exp(A*sparse(x))</code> (with doubled memory impact for a non-sparse matrix of type sparse) and <code>full(exp(A*sparse(x))</code> (which again yields a non-sparse result). <pre class="prettyprint"><code>sx = sparse(x); tic, for i = 1 : 10, exp(full(A*x)); end, toc tic, for i = 1 : 10, full(exp(A*x)); end, toc tic, for i = 1 : 10, exp(full(A*sx)); end, toc tic, for i = 1 : 10, exp(A*sx); end, toc tic, for i = 1 : 10, full(exp(A*sx)); end, toc Elapsed time is 1.485935 seconds. Elapsed time is 1.511304 seconds. Elapsed time is 2.060104 seconds. Elapsed time is 3.194711 seconds. Elapsed time is 4.534749 seconds. </code></pre> Yes, I do calculate element-wise exp, I update the above equation to reflect that. One more edit: I tried to be smart, with little success: <pre class="prettyprint"><code>tic, for i = 1 : 10, B = exp(A*x); end, toc tic, for i = 1 : 10, C = 1 + full(spfun(@(x) exp(x) - 1, A * sx)); end, toc tic, for i = 1 : 10, D = 1 + full(spfun(@(x) exp(x) - 1, A * x)); end, toc tic, for i = 1 : 10, E = 1 + full(spfun(@(x) exp(x) - 1, sparse(A * x))); end, toc tic, for i = 1 : 10, F = 1 + spfun(@(x) exp(x) - 1, A * sx); end, toc tic, for i = 1 : 10, G = 1 + spfun(@(x) exp(x) - 1, A * x); end, toc tic, for i = 1 : 10, H = 1 + spfun(@(x) exp(x) - 1, sparse(A * x)); end, toc Elapsed time is 1.490776 seconds. Elapsed time is 2.031305 seconds. Elapsed time is 2.743365 seconds. Elapsed time is 2.818630 seconds. Elapsed time is 2.176082 seconds. Elapsed time is 2.779800 seconds. Elapsed time is 2.900107 seconds. </code></pre>

Computers don't really do exponents. You would think they do, but what they do is high-accuracy polynomial approximations. References: <ul> <li>http://www.math.vanderbilt.edu/~esaff/texts/13.pdf</li> <li>http://deepblue.lib.umich.edu/bitstream/handle/2027.42/33109/0000495.pdf</li> <li>http://www.cs.yale.edu/homes/sachdeva/pubs/fast-algos-via-approx-theory.pdf</li> </ul> The last reference looked quite nice. Perhaps it should have been first. Since you are working on images, you likely have discrete number of intensity levels (255 typically). This can allow reduced sampling, or lookups, depending on the nature of "A". One way to check this is to do something like the following for a sufficiently representative group of values of "x": <pre class="prettyprint"><code>y=Ax cdfplot(y(:)) </code></pre> If you were able to pre-segment your images into "more interesting" and "not as interesting" - like if you were looking at an x-ray being able to trim out all the "outside the human body" locations and clamp them to zero to pre-sparsify your data, that could reduce your number of unique values. You might consider the previous for each unique "mode" inside the data. My approaches would include: <ul> <li>look at alternate formulations of exp(x) that are lower accuracy but higher speed</li> <li>consider table lookups if you have few enough levels of "x"</li> <li>consider a combination of interpolation and table lookups if you have "slightly too many" levels to do a table lookup</li> <li>consider a single lookup (or alternate formulation) based on segmented mode. If you know it is a bone and are looking for a vein, then maybe it should get less high-cost data processing applied.</li> </ul> Now I have to ask myself why would you be living in so many iterations of exp(A*x)*x and I think you might be switching back and forth between frequency/wavenumber domain and time/space domain. You also might be dealing with probabilities using exp(x) as a basis, and doing some Bayesian fun. I don't know that exp(x) is a good conjugate prior, so I'm going to go with the fourier material. Other options: - consider use of fft, fft2, or fftn given your matrices - they are fast and might do part of what you are looking for. I am sure there is a forier domain variation on the following: <ul> <li>https://mathoverflow.net/questions/34173/fast-matrix-multiplication</li> <li>http://www-cc.cs.uni-saarland.de/media/oldmaterial/bc.pdf</li> <li>http://arxiv.org/PS_cache/math/pdf/0511/0511460v1.pdf</li> </ul> You might be able to mix the lookup with a compute using the woodbury matrix. I would have to think about that some to be sure though. (link) At one point I knew that everything that mattered (CFD, FEA, FFT) were all about the matrix inversion, but I have since forgotten the particular details. Now, if you are living in MatLab then you might consider using "coder" which converts MatLab code to c-code. No matter how much fun an interpreter may be, a good c-compiler can be a lot faster. The mnemonic (hopefully not too ambitious) that I use is shown here: link starting around 13:49. It is really simple, but it shows the difference between a canonical interpreted language (python) and compiled version of the same (cython/c). I'm sure that if I had some more specifics, and was requested to, then I could engage more aggressively in a more specifically relevant answer. You might not have a good way to do it on conventional hardware, buy you might consider something like a GPGPU. CUDA and its peers have massively parallel operations that allow substantial speedup for the cost of a few video cards. You can have thousands of "cores" (overglorified pipelines) doing the work of a few ALU's and if the job is properly parallelizable (as this looks like) then it can get done a LOT faster. EDIT: I was thinking about Eureqa. One option that I would consider if I had some "big iron" for development but not production would be to use their Eureqa product to come up with a fast enough, accurate enough approximation. If you performed a 'quick' singular value decomposition of your "A" matrix, you would find that the dominant performance is governed by 81 eigenvectors. I would look at the eigenvalues and see if there were only a few of those 81 eigenvectors providing the majority of the information. If that was the case, then you can clamp the others to zero, and construct a simple transformation. Now, if it were me, I would want to get "A" out of the exponent. I'm wondering if you can look at the 81x81 eigenvector matrix and "x" and think a little about linear algebra, and what space you are projecting your vectors into. Is there any way that you can make a function that looks like the following: f(x) = B2 * exp( B1 * x ) such that the B1 * x is much smaller rank than your current Ax.

Analytical way of speeding up exp(A*x) in MATLAB

Tags:

performance

matlab

exp

simplify

I need to calculate f(x)=exp(A*x) repeatedly for a tiny, variable column vector x and a huge, constant matrix A (many rows, few columns). In other words, the x are few, but the A*x are many. My problem dimensions are such that A*x takes about as much runtime as the exp() part.

Apart from Taylor expansion and pre-calculating a range of values exp(y) (assuming known the range y of values of A*x), which I haven't managed to speed up considerably (while maintaining accuracy) with respect to what MATLAB is doing on its own, I am thinking about analytically restating the problem in order to be able to precalculate some values.

For example, I find that exp(A*x)_i = exp(\sum_j A_ij x_j) = \prod_j exp(A_ij x_j) = \prod_j exp(A_ij)^x_j

This would allow me to precalculate exp(A) once, but the required exponentiation in the loop is as costly as the original exp() function call, and the multiplications (\prod) have to be carried out in addition.

Is there any other idea that I could follow, or solutions within MATLAB that I may have missed?

Edit: some more details

A is 26873856 by 81 in size (yes, it's that huge), so x is 81 by 1. nnz(A) / numel(A) is 0.0012, nnz(A*x) / numel(A*x) is 0.0075. I already use a sparse matrix to represent A, however, exp() of a sparse matrix is not sparse any longer. So in fact, I store x non-sparse and I calculate exp(full(A*x)) which turned out to be as fast/slow as full(exp(A*x)) (I think A*x is non-sparse anyway, since x is non-sparse.) exp(full(A*sparse(x))) is a way to have a sparse A*x, but is slower. Even slower variants are exp(A*sparse(x)) (with doubled memory impact for a non-sparse matrix of type sparse) and full(exp(A*sparse(x)) (which again yields a non-sparse result).

sx = sparse(x);
tic, for i = 1 : 10, exp(full(A*x)); end, toc
tic, for i = 1 : 10, full(exp(A*x)); end, toc
tic, for i = 1 : 10, exp(full(A*sx)); end, toc
tic, for i = 1 : 10, exp(A*sx); end, toc
tic, for i = 1 : 10, full(exp(A*sx)); end, toc

Elapsed time is 1.485935 seconds.
Elapsed time is 1.511304 seconds.
Elapsed time is 2.060104 seconds.
Elapsed time is 3.194711 seconds.
Elapsed time is 4.534749 seconds.

Yes, I do calculate element-wise exp, I update the above equation to reflect that.

One more edit: I tried to be smart, with little success:

tic, for i = 1 : 10, B = exp(A*x); end, toc
tic, for i = 1 : 10, C = 1 + full(spfun(@(x) exp(x) - 1, A * sx)); end, toc
tic, for i = 1 : 10, D = 1 + full(spfun(@(x) exp(x) - 1, A * x)); end, toc
tic, for i = 1 : 10, E = 1 + full(spfun(@(x) exp(x) - 1, sparse(A * x))); end, toc
tic, for i = 1 : 10, F = 1 + spfun(@(x) exp(x) - 1, A * sx); end, toc
tic, for i = 1 : 10, G = 1 + spfun(@(x) exp(x) - 1, A * x); end, toc
tic, for i = 1 : 10, H = 1 + spfun(@(x) exp(x) - 1, sparse(A * x)); end, toc

Elapsed time is 1.490776 seconds.
Elapsed time is 2.031305 seconds.
Elapsed time is 2.743365 seconds.
Elapsed time is 2.818630 seconds.
Elapsed time is 2.176082 seconds.
Elapsed time is 2.779800 seconds.
Elapsed time is 2.900107 seconds.

769

asked Apr 24 '14 09:04

bers

1 Answers

Computers don't really do exponents. You would think they do, but what they do is high-accuracy polynomial approximations.

References:

http://www.math.vanderbilt.edu/~esaff/texts/13.pdf
http://deepblue.lib.umich.edu/bitstream/handle/2027.42/33109/0000495.pdf
http://www.cs.yale.edu/homes/sachdeva/pubs/fast-algos-via-approx-theory.pdf

The last reference looked quite nice. Perhaps it should have been first.

Since you are working on images, you likely have discrete number of intensity levels (255 typically). This can allow reduced sampling, or lookups, depending on the nature of "A". One way to check this is to do something like the following for a sufficiently representative group of values of "x":

y=Ax
cdfplot(y(:))

If you were able to pre-segment your images into "more interesting" and "not as interesting" - like if you were looking at an x-ray being able to trim out all the "outside the human body" locations and clamp them to zero to pre-sparsify your data, that could reduce your number of unique values. You might consider the previous for each unique "mode" inside the data.

My approaches would include:

look at alternate formulations of exp(x) that are lower accuracy but higher speed
consider table lookups if you have few enough levels of "x"
consider a combination of interpolation and table lookups if you have "slightly too many" levels to do a table lookup
consider a single lookup (or alternate formulation) based on segmented mode. If you know it is a bone and are looking for a vein, then maybe it should get less high-cost data processing applied.

Now I have to ask myself why would you be living in so many iterations of exp(A*x)*x and I think you might be switching back and forth between frequency/wavenumber domain and time/space domain. You also might be dealing with probabilities using exp(x) as a basis, and doing some Bayesian fun. I don't know that exp(x) is a good conjugate prior, so I'm going to go with the fourier material.

Other options: - consider use of fft, fft2, or fftn given your matrices - they are fast and might do part of what you are looking for.

I am sure there is a forier domain variation on the following:

https://mathoverflow.net/questions/34173/fast-matrix-multiplication
http://www-cc.cs.uni-saarland.de/media/oldmaterial/bc.pdf
http://arxiv.org/PS_cache/math/pdf/0511/0511460v1.pdf

You might be able to mix the lookup with a compute using the woodbury matrix. I would have to think about that some to be sure though. (link) At one point I knew that everything that mattered (CFD, FEA, FFT) were all about the matrix inversion, but I have since forgotten the particular details.

Now, if you are living in MatLab then you might consider using "coder" which converts MatLab code to c-code. No matter how much fun an interpreter may be, a good c-compiler can be a lot faster. The mnemonic (hopefully not too ambitious) that I use is shown here: link starting around 13:49. It is really simple, but it shows the difference between a canonical interpreted language (python) and compiled version of the same (cython/c).

I'm sure that if I had some more specifics, and was requested to, then I could engage more aggressively in a more specifically relevant answer.

You might not have a good way to do it on conventional hardware, buy you might consider something like a GPGPU. CUDA and its peers have massively parallel operations that allow substantial speedup for the cost of a few video cards. You can have thousands of "cores" (overglorified pipelines) doing the work of a few ALU's and if the job is properly parallelizable (as this looks like) then it can get done a LOT faster.

EDIT:

I was thinking about Eureqa. One option that I would consider if I had some "big iron" for development but not production would be to use their Eureqa product to come up with a fast enough, accurate enough approximation.

If you performed a 'quick' singular value decomposition of your "A" matrix, you would find that the dominant performance is governed by 81 eigenvectors. I would look at the eigenvalues and see if there were only a few of those 81 eigenvectors providing the majority of the information. If that was the case, then you can clamp the others to zero, and construct a simple transformation.

Now, if it were me, I would want to get "A" out of the exponent. I'm wondering if you can look at the 81x81 eigenvector matrix and "x" and think a little about linear algebra, and what space you are projecting your vectors into. Is there any way that you can make a function that looks like the following:

f(x) = B2 * exp( B1 * x )

such that the

B1 * x

is much smaller rank than your current

Ax.

155

answered Oct 09 '22 22:10

EngrStudent

Related questions
                            
                                Java performance in numerical algorithms
                            
                                testing performance of a website
                            
                                Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?
                            
                                Detect how much stress the mysql database is currently under with PHP
                            
                                Detecting W3WP CPU issues using jetBrains dotTrace
                            
                                JS tween how to improve?
                            
                                Why does a hudson "mvn clean install" build take 3-6x longer than the same on the command line?
                            
                                Load Time on asp.net page cannot be accounted for
                            
                                Simple method call is really slow?
                            
                                Calling SQL Functions much slower when using SqlCommand Parameters
                            
                                Closest Pair Algorithm in JavaScript
                            
                                Speeding up array lookup after traversing?
                            
                                How can I speed $.offset() up?
                            
                                Liferay startup takes way too long
                            
                                Getting an advance warning before full GC
                            
                                In what situations can do-while be more efficient than while?
                            
                                Iterating speed and data type
                            
                                Is the as operator with a nullable value type unnecessarily slow?
                            
                                Algorithm to find the union of multiple strings grouped by character index
                            
                                Why is my bottom-up merge sort so slow in Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With