Assembly code/AVX instructions for multiplication of complex numbers. (GCC inline assembly)

Question

We're running a scientific program and we would like to implement AVX features. The whole program (written in Fortran+C) is going to be vectorized and at the moment I'm trying to implement complex number multiplication within GCC inline assembly.

The assembly code takes 4 complex numbers and performs two complex multiplications at once:

v2complex cmult(v2complex *a, v2complex *b) {
    v2complex ret;
    asm (
        "vmovupd %2,%%ymm1;"
        "vmovupd %2, %%ymm2;"
        "vmovddup %%ymm2, %%ymm2;"
        "vshufpd $15,%%ymm1,%%ymm1,%%ymm1;"
        "vmulpd %1, %%ymm2, %%ymm2;"
        "vmulpd %1, %%ymm1, %%ymm1;"
        "vshufpd $5,%%ymm1,%%ymm1, %%ymm1;"
        "vaddsubpd %%ymm1, %%ymm2,%%ymm1;"
        "vmovupd %%ymm1, %0;"
        :
        "=m"(ret)
        :
        "m" (*a),
        "m" (*b)
        );
    return ret;
}

where a and b are 256-bit double precision:

typedef union v2complex {
    __m256d v;
    complex c[2];
} v2complex;

The problem is that that the code mostly produces the correct result, but sometimes it fails.

I am very new to assembly, but I tried to figure it out by myself. It seems that the C program (optimized -O3) interacts with the registers ymm used in the assembly code. For instance, I can printf one of the values (e.g. a) before executing the multiplication and the program does never give wrong results.

My question is how to tell GCC not to interact with ymm. I did not manage to put the ymm to clobbered registers list.

Stephen Canon · Accepted Answer

As you surmise, the problem is that you haven’t told GCC which registers you are clobbering. I’m surprised if they don’t yet support putting YMM registers in the clobber list; what version of GCC are you using?

In any event, it will almost certainly suffice to put the corresponding XMM registers in the clobber list instead:

: "=m" (ret) : "m" (*a), "m" (*b) : "%xmm1", "%xmm2");

Some other notes:

You’re loading both inputs twice, which is inefficient. There’s no reason to do that.
I would use "r" (a), "r" (b) as constraints and write my loads like vmovupd (%2), %%ymm1. Probably no difference in the generated code, but it seems more idiomatically correct.
Don’t forget to put a vzeroupper following AVX code before any SSE code is executed to avoid (large) stalls.

Norbert P. · Answer

I add two comments, not directly answering your question:

I strongly suggest using compiler intrinsics instead of direct assembly. This way the compiler takes care of the register allocation and can do better job at optimizing your code (inline methods, reorder instructions, etc.)
Agner Fog has a C++ vector class library of optimized vectorized operations, including operations on complex numbers. Even if you might not be able to use his libraries directly in your C code, his optimized code might be a good starting point; see src/special/complexvec.h in the zipped source code.

Assembly code/AVX instructions for multiplication of complex numbers. (GCC inline assembly)

Tags:

c

gcc

assembly

avx

complex-numbers

Jean Nicolas

2 Answers

Stephen Canon

Norbert P.

Recent Activity

Donate For Us

Assembly code/AVX instructions for multiplication of complex numbers. (GCC inline assembly)

Tags:

c

gcc

assembly

avx

complex-numbers

Jean Nicolas

2 Answers

Stephen Canon

Norbert P.

Related questions

Recent Activity

Donate For Us