Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to square two complex doubles with 256-bit AVX vectors?

Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:

__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
__m256d neg  = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);

/* Step 1: Multiply vec1 and vec2 */
__m256d vec3 = _mm256_mul_pd(vec1, vec2);

/* Step 2: Switch the real and imaginary elements of vec2 */
vec2 = _mm256_permute_pd(vec2, 0x5);

/* Step 3: Negate the imaginary elements of vec2 */
vec2 = _mm256_mul_pd(vec2, neg);  

/* Step 4: Multiply vec1 and the modified vec2 */
__m256d vec4 = _mm256_mul_pd(vec1, vec2);

/* Horizontally subtract the elements in vec3 and vec4 */
vec1 = _mm256_hsub_pd(vec3, vec4);

/* Display the elements of the result vector */
double* res = (double*)&vec1;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);

My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:

struct cmplx a;
struct cmplx b;

a.r = 2.5341;
a.i = 1.843;

b.r = 1.3941;
b.i = 0.93;

__m256d zzs = squareZ(a, b);

double* res = (double*) &zzs;

printf("\nA: %f + %f,  B: %f + %f\n", res[0], res[1], res[2], res[3]);

Using Haskell's complex arithmetic, I have verified the results are correct except, as you can see, the real part of B:

A: 3.025014 + 9.340693,  B: 0.000000 + 2.593026

So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?

like image 845
sacheie Avatar asked Sep 15 '16 11:09

sacheie


People also ask

What is __ m256?

The __m256 data type is used to represent the contents of the extended SSE register - the YMM register, used by the Intel® AVX intrinsics.

What are AVX intrinsics?

AVX provides intrinsic functions that combine one or more values into a 256-bit vector. Table 2 lists their names and provides a description of each. There are similar intrinsics that initialize 128-bit vectors, but those are provided by SSE, not AVX.

What is Immintrin?

The immintrin. h header file defines a set of data types that represent different types of vectors. These are; __m256 : This is a vector of eight floating point numbers (8x32 = 256 bits)


1 Answers

This answer covers the general case of multiplying two arrays of complex numbers

Ideally, store your data in separate real and imaginary arrays, so you can just load contiguous vectors of real and imaginary parts. That makes it free to do the cross-multiplying (just use different registers / variables) instead of having to shuffle things around within a vector.

You can convert between interleaved double complex style and SIMD-friendly separate-vectors style on the fly fairly cheaply, subject to the vagaries of AVX in-lane shuffles. e.g. very cheaply with unpacklo / unpackhi shuffles to de-interleave or to re-interleave within a lane, if you don't care about the actual order of the data within the temporary vector.

It's actually so cheap to do this shuffle that doing it on the fly for a single complex multiply comes out somewhat ahead of (even a tweaked version of) Matt's code, especially on CPUs that support FMA. This requires producing results in groups of 4 complex doubles (2 result vectors).

If you need to produce only one result vector at a time, I also came up with an alternative to Matt's algorithm that can use FMA (actually FMADDSUB) and avoid the separate sign-change insn.


gcc auto-vectorizes simple complex multiply scalar loop to pretty good code, as long as you use -ffast-math. It deinterleaves like I suggested.

#include <complex.h>

// even with -ffast-math -ffp-contract=fast, clang doesn't manage to use vfmaddsubpd, instead using vmulpd and vaddsubpd :(
// gcc does use FMA though.

// auto-vectorizes with a lot of extra shuffles
void cmul(double complex *restrict dst,
          const double complex *restrict A, const double complex *restrict B)
{       // clang and gcc change strategy slightly for i<1 or i<2, vs. i<4
  for (int i=0; i<4 ; i++) {
    dst[i] = A[i] * B[i];
  }
}

See the asm on the Godbolt compiler explorer. I'm not sure how good clang's asm is; it uses a lot of 64b->128b VMODDDUP broadcast-loads. This form is handled purely in the load ports on Intel CPUs (see Agner Fog's insn tables), but it's still a lot of operations. As mentioned earlier, gcc uses 4 VPERMPD shuffles to reorder within lanes before multiplying / FMA, then another 4 VPERMPD to reorder the results before combining them with VSHUFPD. This is 8 extra shuffles for 4 complex multiplies.


Converting gcc's version back to intrinsics and removing the redundant shuffles gives optimal code. (gcc apparently wants its temporaries to be in A B C D order instead of the A C B D order resulting from the in-lane behaviour of VUNPCKLPD (_mm256_unpacklo_pd)).

I put the code on Godbolt, along with a tweaked version of Matt's code. So you can play around with different compiler options, and also different compiler versions.

// multiplies 4 complex doubles each from A and B, storing the result in dst[0..3]
void cmul_manualvec(double complex *restrict dst,
          const double complex *restrict A, const double complex *restrict B)
{
                                         // low element first, little-endian style
  __m256d A0 = _mm256_loadu_pd((double*)A);    // [A0r A0i  A1r A1i ] // [a b c d ]
  __m256d A2 = _mm256_loadu_pd((double*)(A+2));                       // [e f g h ]
  __m256d realA = _mm256_unpacklo_pd(A0, A2);  // [A0r A2r  A1r A3r ] // [a e c g ]
  __m256d imagA = _mm256_unpackhi_pd(A0, A2);  // [A0i A2i  A1i A3i ] // [b f d h ]
  // the in-lane behaviour of this interleaving is matched by the same in-lane behaviour when we recombine.
  
  __m256d B0 = _mm256_loadu_pd((double*)B);                           // [m n o p]
  __m256d B2 = _mm256_loadu_pd((double*)(B+2));                       // [q r s t]
  __m256d realB = _mm256_unpacklo_pd(B0, B2);                         // [m q o s]
  __m256d imagB = _mm256_unpackhi_pd(B0, B2);                         // [n r p t]

  // desired:  real=rArB - iAiB,  imag=rAiB + rBiA
  __m256d realprod = _mm256_mul_pd(realA, realB);
  __m256d imagprod = _mm256_mul_pd(imagA, imagB);
  
  __m256d rAiB     = _mm256_mul_pd(realA, imagB);
  __m256d rBiA     = _mm256_mul_pd(realB, imagA);

  // gcc and clang will contract these nto FMA.  (clang needs -ffp-contract=fast)
  // Doing it manually would remove the option to compile for non-FMA targets
  __m256d real     = _mm256_sub_pd(realprod, imagprod);  // [D0r D2r | D1r D3r]
  __m256d imag     = _mm256_add_pd(rAiB, rBiA);          // [D0i D2i | D1i D3i]

  // interleave the separate real and imaginary vectors back into packed format
  __m256d dst0 = _mm256_shuffle_pd(real, imag, 0b0000);  // [D0r D0i | D1r D1i]
  __m256d dst2 = _mm256_shuffle_pd(real, imag, 0b1111);  // [D2r D2i | D3r D3i]
  _mm256_storeu_pd((double*)dst, dst0);
  _mm256_storeu_pd((double*)(dst+2), dst2);   
}

  Godbolt asm output: gcc6.2 -O3 -ffast-math -ffp-contract=fast -march=haswell
    vmovupd         ymm0, YMMWORD PTR [rsi+32]
    vmovupd         ymm3, YMMWORD PTR [rsi]
    vmovupd         ymm1, YMMWORD PTR [rdx]
    vunpcklpd       ymm5, ymm3, ymm0
    vunpckhpd       ymm3, ymm3, ymm0
    vmovupd         ymm0, YMMWORD PTR [rdx+32]
    vunpcklpd       ymm4, ymm1, ymm0
    vunpckhpd       ymm1, ymm1, ymm0
    vmulpd          ymm2, ymm1, ymm3
    vmulpd          ymm0, ymm4, ymm3
    vfmsub231pd     ymm2, ymm4, ymm5     # separate mul/sub contracted into FMA
    vfmadd231pd     ymm0, ymm1, ymm5
    vunpcklpd       ymm1, ymm2, ymm0
    vunpckhpd       ymm0, ymm2, ymm0
    vmovupd         YMMWORD PTR [rdi], ymm1
    vmovupd         YMMWORD PTR [rdi+32], ymm0
    vzeroupper
    ret

For 4 complex multiplies (of 2 pairs of input vectors), my code uses:

  • 4 loads (32B each)
  • 2 stores (32B each)
  • 6 in-lane shuffles (one for each input vector, one for each output)
  • 2 VMULPD
  • 2 VFMA...something
  • (only 4 shuffles if we can use the results in separated real and imag vectors, or 0 shuffles if the inputs are already in this format, too)
  • latency on Intel Skylake (not counting loads/stores): 14 cycles = 4c for 4 shuffles until the second VMULPD can start + 4 cycles (second VMULPD) + 4c (second vfmadd231pd) + 1c (shuffle first result vector ready 1c earlier) + 1c (shuffle second result vector)

So for throughput, this completely bottlenecks on the shuffle port. (1 shuffle per clock throughput, vs. 2 total MUL/FMA/ADD per clock on Intel Haswell and later). This is why packed storage is horrible: shuffles have limited throughput, and spending more instructions shuffling than on doing math is not good.


Matt Scarpino's code with my minor tweaks (repeated to do 4 complex multiplies). (See below for my rewrite of producing one vector at a time more efficiently).

  • the same 6 loads/stores
  • 6 in-lane shuffles (HSUBPD is 2 shuffles and a subtract on current Intel and AMD CPUs)
  • 4 multiplies
  • 2 subtracts (which can't combine with the muls into FMAs)
  • An extra instruction (+ a constant) to flip the sign of the imaginary elements. Matt chose to multiply by 1.0 or -1.0, but the efficient choice is to XOR the sign bit (i.e. XORPD with -0.0).
  • latency on Intel Skylake for the first result vector: 11 cycles. 1c(vpermilpd and vxorpd in the same cycle) + 4c(second vmulpd) + 6c(vhsubpd). The first vmulpd overlaps with other ops, starting in the same cycle as the shuffle and vxorpd. Computation of a second result vector should interleave pretty nicely.

The major advantage of Matt's code is that it works with just one vector-width of complex multiplies at once, instead of requiring you to have 4 input vectors of data. It has somewhat lower latency. But note that my version doesn't need the 2 pairs of input vectors to be from contiguous memory, or related to each other at all. They get mixed together while processing, but the result is 2 separate 32B vectors.

My tweaked version of Matt's code is nearly as good (as the 4-at-a-time version) on CPUs without FMA (just costing an extra VXORPD), but significantly worse when it stops us from taking advantage of FMA. Also, it never has the results available in non-packed form, so you can't use the separated form as input to another multiply and skip the shuffling.


One vector result at a time, with FMA:

Don't use this if you're sometimes squaring, instead of multiplying two different complex numbers. This is like Matt's algorithm in that common subexpression elimination doesn't simplify.

I haven't typed in the C intrinsics for this, just worked out the data movement. Since all the shuffles are in-lane, I'll only show the low lane. Use the 256b versions of the relevant instructions to do the same shuffle in both lanes. They stay separate.

// MULTIPLY: for each AVX lane: am-bn, an+bm  
 r i  r i
 a b  c d       // input vectors: a + b*i, etc.
 m n  o p

Algorithm:

  • create bm bn with movshdup(a b) + mulpd

  • create bn bm with shufpd on the previous result. (or create n m with a shuffle before the mul)

  • create a a with movsldup(a b)

  • use fmaddsubpd to produce the final result: [a|a]*[m|n] -/+ [bn|bm].

    Yes, SSE/AVX has ADDSUBPD to do alternating subtract/add in even/odd elements (in that order, presumably because of this use-case). FMA includes FMADDSUB132PD which subtracts and adds, (and the reverse, FMSUBADD which adds and subtracts).

Per 4 results: 6x shuffle, 2x mul, 2xfmaddsub. So unless I got something wrong, it's as efficient as the deinterleave method (when not squaring the same number). Skylake latency = 10c = 1+4+1 to create bn bm (overlapping with 1 cycle to create a a), + 4 (FMA). So it's one cycle lower latency than Matt's.

On Bulldozer-family, it would be a win to shuffle both inputs to the first mul, so the mul->fmaddsub critical path stays inside the FMA domain (1 cycle lower latency). Doing it the other way helps stop silly compilers from making resource conflicts by doing the movsldup(a b) too early, and delaying the mulpd. (In a loop, though, many iterations will be in flight and bottleneck on the shuffle port.)


This is still better than Matt's for squaring (still save the XOR, and can use FMA), but we don't save any shuffles:

// SQUARING: for each AVX lane: aa-bb, 2*ab
// ab bb  // movshdup + mul
// bb ab  // ^ -> shufpd

// a  a          // movsldup
// aa-bb  ab+ab  // fmaddsubpd : [a|a]*[a|b] -/+ [bb|ab]
// per 4 results: 6x shuffle, 2x mul, 2xfmaddsub

I also played around with some possibilities like (a+b) * (a+b) = aa+2ab+bb, or (r-i)*(r+i) = rr - ii but didn't get anywhere. Rounding between steps means that FP math doesn't cancel perfectly, so doing something like this wouldn't even produce exactly identical results.

like image 52
Peter Cordes Avatar answered Sep 28 '22 17:09

Peter Cordes