Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize this product of three matrices in C++ for x86?

I have a key algorithm in which most of its runtime is spent on calculating a dense matrix product:

A*A'*Y, where: A is an m-by-n matrix, 
               A' is its conjugate transpose,
               Y is an m-by-k matrix

Typical characteristics:
    - k is much smaller than both m or n (k is typically < 10)
    - m in the range [500, 2000]
    - n in the range [100, 1000]

Based on these dimensions, according to the lessons of the matrix chain multiplication problem, it's clear that it's optimal in a number-of-operations sense to structure the computation as A*(A'*Y). My current implementation does that, and the performance boost from just forcing that associativity to the expression is noticeable.

My application is written in C++ for the x86_64 platform. I'm using the Eigen linear algebra library, with Intel's Math Kernel Library as a backend. Eigen is able to use IMKL's BLAS interface to perform the multiplication, and the boost from moving to Eigen's native SSE2 implementation to Intel's optimized, AVX-based implementation on my Sandy Bridge machine is also significant.

However, the expression A * (A.adjoint() * Y) (written in Eigen parlance) gets decomposed into two general matrix-matrix products (calls to the xGEMM BLAS routine), with a temporary matrix created in between. I'm wondering if, by going to a specialized implementation for evaluating the entire expression at once, I can arrive at an implementation that is faster than the generic one that I have now. A couple observations that lead me to believe this are:

  • Using the typical dimensions described above, the input matrix A usually won't fit in cache. Therefore, the specific memory access pattern used to calculate the three-matrix product would be key. Obviously, avoiding the creation of a temporary matrix for the partial product would also be advantageous.

  • A and its conjugate transpose obviously have a very related structure that could possibly be exploited to improve the memory access pattern for the overall expression.

Are there any standard techniques for implementing this sort of expression in a cache-friendly way? Most optimization techniques that I've found for matrix multiplication are for the standard A*B case, not larger expressions. I'm comfortable with the micro-optimization aspects of the problem, such as translating into the appropriate SIMD instruction sets, but I'm looking for any references out there for breaking this structure down in the most memory-friendly manner possible.

Edit: Based on the responses that have come in thus far, I think I was a bit unclear above. The fact that I'm using C++/Eigen is really just an implementation detail from my perspective on this problem. Eigen does a great job of implementing expression templates, but evaluating this type of problem as a simple expression just isn't supported (only products of 2 general dense matrices are).

At a higher level than how the expressions would be evaluated by a compiler, I'm looking for a more efficient mathematical breakdown of the composite multiplication operation, with a bent toward avoiding unneeded redundant memory accesses due to the common structure of A and its conjugate transpose. The result would likely be difficult to implement efficiently in pure Eigen, so I would likely just implement it in a specialized routine with SIMD intrinsics.

like image 629
Jason R Avatar asked Feb 27 '14 15:02

Jason R


1 Answers

This is not a full answer (yet - and I'm not sure it will become one).

Let's think of the math first a little. Since matrix multiplication is associative we can either do (A*A')Y or A(A'*Y).

Floating point operations for (A*A')*Y

2*m*n*m + 2*m*m*k //the twos come from addition and multiplication

Floating point operations for A*(A'*Y)

2*m*n*k + 2*m*n*k = 4*m*n*k

Since k is much smaller than m and n it's clear why the second case is much faster.

But by symmetry we could in principle reduce the number of calculations for A*A' by two (though this might not be easy to do with SIMD) so we could reduce the number of floating point operations of (A*A')*Y to

m*n*m + 2*m*m*k.

We know that both m and n are larger than k. Let's choose a new variable for m and n called z and find out where case one and two are equal:

z*z*z + 2*z*z*k = 4*z*z*k  //now simplify
z = 2*k.

So as long as m and n are both more than twice k the second case will have less floating point operations. In your case m and n are both more than 100 and k less than 10 so case two uses far fewer floating point operations.

In terms of efficient code. If the code is optimized for efficient use of the cache (as MKL and Eigen are) then large dense matrix multiplication is computation bound and not memory bound so you don't have to worry about the cache. MKL is faster than Eigen since MKL uses AVX (and maybe fma3 now?).

I don't think you will be able to do this more efficiently than you're already doing using the second case and MKL (through Eigen). Enable OpenMP to get maximum FLOPS.

You should calculate the efficiency by comparing FLOPS to the peak FLOPS of you processor. Assuming you have a Sandy Bridge/Ivy Bridge processor. The peak SP FLOPS is

frequency * number of physical cores * 8 (8-wide AVX SP) * 2 (addition + multiplication)

For double precession divide by two. If you have Haswell and MKL uses FMA then double the peak FLOPS. To get the frequency right you have to use the turbo boost values for all cores (it's lower than for a single core). You can look this up if you have not overclocked your system or use CPU-Z on Windows or Powertop on Linux if you have an overclocked system.

like image 64
Z boson Avatar answered Sep 19 '22 03:09

Z boson