Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assembly code/AVX instructions for multiplication of complex numbers. (GCC inline assembly)

We're running a scientific program and we would like to implement AVX features. The whole program (written in Fortran+C) is going to be vectorized and at the moment I'm trying to implement complex number multiplication within GCC inline assembly.

The assembly code takes 4 complex numbers and performs two complex multiplications at once:

v2complex cmult(v2complex *a, v2complex *b) {
    v2complex ret;
    asm (
        "vmovupd %2,%%ymm1;"
        "vmovupd %2, %%ymm2;"
        "vmovddup %%ymm2, %%ymm2;"
        "vshufpd $15,%%ymm1,%%ymm1,%%ymm1;"
        "vmulpd %1, %%ymm2, %%ymm2;"
        "vmulpd %1, %%ymm1, %%ymm1;"
        "vshufpd $5,%%ymm1,%%ymm1, %%ymm1;"
        "vaddsubpd %%ymm1, %%ymm2,%%ymm1;"
        "vmovupd %%ymm1, %0;"
        :
        "=m"(ret)
        :
        "m" (*a),
        "m" (*b)
        );
    return ret;
}

where a and b are 256-bit double precision:

typedef union v2complex {
    __m256d v;
    complex c[2];
} v2complex;

The problem is that that the code mostly produces the correct result, but sometimes it fails.

I am very new to assembly, but I tried to figure it out by myself. It seems that the C program (optimized -O3) interacts with the registers ymm used in the assembly code. For instance, I can printf one of the values (e.g. a) before executing the multiplication and the program does never give wrong results.

My question is how to tell GCC not to interact with ymm. I did not manage to put the ymm to clobbered registers list.

like image 583
Jean Nicolas Avatar asked Apr 02 '13 14:04

Jean Nicolas


2 Answers

As you surmise, the problem is that you haven’t told GCC which registers you are clobbering. I’m surprised if they don’t yet support putting YMM registers in the clobber list; what version of GCC are you using?

In any event, it will almost certainly suffice to put the corresponding XMM registers in the clobber list instead:

: "=m" (ret) : "m" (*a), "m" (*b) : "%xmm1", "%xmm2");

Some other notes:

  • You’re loading both inputs twice, which is inefficient. There’s no reason to do that.
  • I would use "r" (a), "r" (b) as constraints and write my loads like vmovupd (%2), %%ymm1. Probably no difference in the generated code, but it seems more idiomatically correct.
  • Don’t forget to put a vzeroupper following AVX code before any SSE code is executed to avoid (large) stalls.
like image 125
Stephen Canon Avatar answered Sep 23 '22 19:09

Stephen Canon


I add two comments, not directly answering your question:

  • I strongly suggest using compiler intrinsics instead of direct assembly. This way the compiler takes care of the register allocation and can do better job at optimizing your code (inline methods, reorder instructions, etc.)
  • Agner Fog has a C++ vector class library of optimized vectorized operations, including operations on complex numbers. Even if you might not be able to use his libraries directly in your C code, his optimized code might be a good starting point; see src/special/complexvec.h in the zipped source code.
like image 26
Norbert P. Avatar answered Sep 23 '22 19:09

Norbert P.