Matt Scarpino gives a good explanation (although he admits he's not sure it's the optimal algorithm, I offer him my gratitude) for how to multiply two complex doubles with Intel's AVX intrinsics. Here's his method, which I've verified:
__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
__m256d neg = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);
/* Step 1: Multiply vec1 and vec2 */
__m256d vec3 = _mm256_mul_pd(vec1, vec2);
/* Step 2: Switch the real and imaginary elements of vec2 */
vec2 = _mm256_permute_pd(vec2, 0x5);
/* Step 3: Negate the imaginary elements of vec2 */
vec2 = _mm256_mul_pd(vec2, neg);
/* Step 4: Multiply vec1 and the modified vec2 */
__m256d vec4 = _mm256_mul_pd(vec1, vec2);
/* Horizontally subtract the elements in vec3 and vec4 */
vec1 = _mm256_hsub_pd(vec3, vec4);
/* Display the elements of the result vector */
double* res = (double*)&vec1;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);
My problem is that I want to square two complex doubles. I tried to use Matt's technique like so:
struct cmplx a;
struct cmplx b;
a.r = 2.5341;
a.i = 1.843;
b.r = 1.3941;
b.i = 0.93;
__m256d zzs = squareZ(a, b);
double* res = (double*) &zzs;
printf("\nA: %f + %f, B: %f + %f\n", res[0], res[1], res[2], res[3]);
Using Haskell's complex arithmetic, I have verified the results are correct except, as you can see, the real part of B:
A: 3.025014 + 9.340693, B: 0.000000 + 2.593026
So I have two questions really: is there a better (simpler and/or faster) way to square two complex doubles with AVX intrinsics? If not, how can I modify Matt's code to do it?
The __m256 data type is used to represent the contents of the extended SSE register - the YMM register, used by the Intel® AVX intrinsics.
AVX provides intrinsic functions that combine one or more values into a 256-bit vector. Table 2 lists their names and provides a description of each. There are similar intrinsics that initialize 128-bit vectors, but those are provided by SSE, not AVX.
The immintrin. h header file defines a set of data types that represent different types of vectors. These are; __m256 : This is a vector of eight floating point numbers (8x32 = 256 bits)
Ideally, store your data in separate real and imaginary arrays, so you can just load contiguous vectors of real and imaginary parts. That makes it free to do the cross-multiplying (just use different registers / variables) instead of having to shuffle things around within a vector.
You can convert between interleaved double complex
style and SIMD-friendly separate-vectors style on the fly fairly cheaply, subject to the vagaries of AVX in-lane shuffles. e.g. very cheaply with unpacklo / unpackhi shuffles to de-interleave or to re-interleave within a lane, if you don't care about the actual order of the data within the temporary vector.
It's actually so cheap to do this shuffle that doing it on the fly for a single complex multiply comes out somewhat ahead of (even a tweaked version of) Matt's code, especially on CPUs that support FMA. This requires producing results in groups of 4 complex doubles (2 result vectors).
If you need to produce only one result vector at a time, I also came up with an alternative to Matt's algorithm that can use FMA (actually FMADDSUB) and avoid the separate sign-change insn.
gcc auto-vectorizes simple complex multiply scalar loop to pretty good code, as long as you use -ffast-math
. It deinterleaves like I suggested.
#include <complex.h>
// even with -ffast-math -ffp-contract=fast, clang doesn't manage to use vfmaddsubpd, instead using vmulpd and vaddsubpd :(
// gcc does use FMA though.
// auto-vectorizes with a lot of extra shuffles
void cmul(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{ // clang and gcc change strategy slightly for i<1 or i<2, vs. i<4
for (int i=0; i<4 ; i++) {
dst[i] = A[i] * B[i];
}
}
See the asm on the Godbolt compiler explorer. I'm not sure how good clang's asm is; it uses a lot of 64b->128b VMODDDUP broadcast-loads. This form is handled purely in the load ports on Intel CPUs (see Agner Fog's insn tables), but it's still a lot of operations. As mentioned earlier, gcc uses 4 VPERMPD shuffles to reorder within lanes before multiplying / FMA, then another 4 VPERMPD to reorder the results before combining them with VSHUFPD. This is 8 extra shuffles for 4 complex multiplies.
Converting gcc's version back to intrinsics and removing the redundant shuffles gives optimal code. (gcc apparently wants its temporaries to be in A B C D order instead of the A C B D order resulting from the in-lane behaviour of VUNPCKLPD (_mm256_unpacklo_pd
)).
I put the code on Godbolt, along with a tweaked version of Matt's code. So you can play around with different compiler options, and also different compiler versions.
// multiplies 4 complex doubles each from A and B, storing the result in dst[0..3]
void cmul_manualvec(double complex *restrict dst,
const double complex *restrict A, const double complex *restrict B)
{
// low element first, little-endian style
__m256d A0 = _mm256_loadu_pd((double*)A); // [A0r A0i A1r A1i ] // [a b c d ]
__m256d A2 = _mm256_loadu_pd((double*)(A+2)); // [e f g h ]
__m256d realA = _mm256_unpacklo_pd(A0, A2); // [A0r A2r A1r A3r ] // [a e c g ]
__m256d imagA = _mm256_unpackhi_pd(A0, A2); // [A0i A2i A1i A3i ] // [b f d h ]
// the in-lane behaviour of this interleaving is matched by the same in-lane behaviour when we recombine.
__m256d B0 = _mm256_loadu_pd((double*)B); // [m n o p]
__m256d B2 = _mm256_loadu_pd((double*)(B+2)); // [q r s t]
__m256d realB = _mm256_unpacklo_pd(B0, B2); // [m q o s]
__m256d imagB = _mm256_unpackhi_pd(B0, B2); // [n r p t]
// desired: real=rArB - iAiB, imag=rAiB + rBiA
__m256d realprod = _mm256_mul_pd(realA, realB);
__m256d imagprod = _mm256_mul_pd(imagA, imagB);
__m256d rAiB = _mm256_mul_pd(realA, imagB);
__m256d rBiA = _mm256_mul_pd(realB, imagA);
// gcc and clang will contract these nto FMA. (clang needs -ffp-contract=fast)
// Doing it manually would remove the option to compile for non-FMA targets
__m256d real = _mm256_sub_pd(realprod, imagprod); // [D0r D2r | D1r D3r]
__m256d imag = _mm256_add_pd(rAiB, rBiA); // [D0i D2i | D1i D3i]
// interleave the separate real and imaginary vectors back into packed format
__m256d dst0 = _mm256_shuffle_pd(real, imag, 0b0000); // [D0r D0i | D1r D1i]
__m256d dst2 = _mm256_shuffle_pd(real, imag, 0b1111); // [D2r D2i | D3r D3i]
_mm256_storeu_pd((double*)dst, dst0);
_mm256_storeu_pd((double*)(dst+2), dst2);
}
Godbolt asm output: gcc6.2 -O3 -ffast-math -ffp-contract=fast -march=haswell
vmovupd ymm0, YMMWORD PTR [rsi+32]
vmovupd ymm3, YMMWORD PTR [rsi]
vmovupd ymm1, YMMWORD PTR [rdx]
vunpcklpd ymm5, ymm3, ymm0
vunpckhpd ymm3, ymm3, ymm0
vmovupd ymm0, YMMWORD PTR [rdx+32]
vunpcklpd ymm4, ymm1, ymm0
vunpckhpd ymm1, ymm1, ymm0
vmulpd ymm2, ymm1, ymm3
vmulpd ymm0, ymm4, ymm3
vfmsub231pd ymm2, ymm4, ymm5 # separate mul/sub contracted into FMA
vfmadd231pd ymm0, ymm1, ymm5
vunpcklpd ymm1, ymm2, ymm0
vunpckhpd ymm0, ymm2, ymm0
vmovupd YMMWORD PTR [rdi], ymm1
vmovupd YMMWORD PTR [rdi+32], ymm0
vzeroupper
ret
For 4 complex multiplies (of 2 pairs of input vectors), my code uses:
So for throughput, this completely bottlenecks on the shuffle port. (1 shuffle per clock throughput, vs. 2 total MUL/FMA/ADD per clock on Intel Haswell and later). This is why packed storage is horrible: shuffles have limited throughput, and spending more instructions shuffling than on doing math is not good.
Matt Scarpino's code with my minor tweaks (repeated to do 4 complex multiplies). (See below for my rewrite of producing one vector at a time more efficiently).
-0.0
).The major advantage of Matt's code is that it works with just one vector-width of complex multiplies at once, instead of requiring you to have 4 input vectors of data. It has somewhat lower latency. But note that my version doesn't need the 2 pairs of input vectors to be from contiguous memory, or related to each other at all. They get mixed together while processing, but the result is 2 separate 32B vectors.
My tweaked version of Matt's code is nearly as good (as the 4-at-a-time version) on CPUs without FMA (just costing an extra VXORPD), but significantly worse when it stops us from taking advantage of FMA. Also, it never has the results available in non-packed form, so you can't use the separated form as input to another multiply and skip the shuffling.
Don't use this if you're sometimes squaring, instead of multiplying two different complex numbers. This is like Matt's algorithm in that common subexpression elimination doesn't simplify.
I haven't typed in the C intrinsics for this, just worked out the data movement. Since all the shuffles are in-lane, I'll only show the low lane. Use the 256b versions of the relevant instructions to do the same shuffle in both lanes. They stay separate.
// MULTIPLY: for each AVX lane: am-bn, an+bm
r i r i
a b c d // input vectors: a + b*i, etc.
m n o p
Algorithm:
create bm bn
with movshdup(a b) + mulpd
create bn bm
with shufpd on the previous result. (or create n m
with a shuffle before the mul)
create a a
with movsldup(a b)
use fmaddsubpd to produce the final result: [a|a]*[m|n] -/+ [bn|bm]
.
Yes, SSE/AVX has ADDSUBPD to do alternating subtract/add in even/odd elements (in that order, presumably because of this use-case). FMA includes FMADDSUB132PD which subtracts and adds, (and the reverse, FMSUBADD which adds and subtracts).
Per 4 results: 6x shuffle, 2x mul, 2xfmaddsub. So unless I got something wrong, it's as efficient as the deinterleave method (when not squaring the same number). Skylake latency = 10c = 1+4+1 to create bn bm
(overlapping with 1 cycle to create a a
), + 4 (FMA). So it's one cycle lower latency than Matt's.
On Bulldozer-family, it would be a win to shuffle both inputs to the first mul, so the mul->fmaddsub critical path stays inside the FMA domain (1 cycle lower latency). Doing it the other way helps stop silly compilers from making resource conflicts by doing the movsldup(a b) too early, and delaying the mulpd. (In a loop, though, many iterations will be in flight and bottleneck on the shuffle port.)
This is still better than Matt's for squaring (still save the XOR, and can use FMA), but we don't save any shuffles:
// SQUARING: for each AVX lane: aa-bb, 2*ab
// ab bb // movshdup + mul
// bb ab // ^ -> shufpd
// a a // movsldup
// aa-bb ab+ab // fmaddsubpd : [a|a]*[a|b] -/+ [bb|ab]
// per 4 results: 6x shuffle, 2x mul, 2xfmaddsub
I also played around with some possibilities like (a+b) * (a+b) = aa+2ab+bb
, or (r-i)*(r+i) = rr - ii
but didn't get anywhere. Rounding between steps means that FP math doesn't cancel perfectly, so doing something like this wouldn't even produce exactly identical results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With