Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient complex arithmetic in x86 assembly for a Mandelbrot loop

Consider the following program:

for i=1 to 10000000 do
  z <- z*z + c

where z and c are complex numbers.

What are efficient x86 assembler implementations of this program using x87 vs SSE and single vs double precision arithmetic?

EDIT I know I can write this in another language and trust the compiler to generate optimal machine code for me but I am doing this to learn how to write optimal x86 assembler myself. I have already looked at the code generated by gcc -O2 and my guess is that there is a lot of room for improvement but I am not adept enough to write optimal x86 assembler by hand myself so I am asking for help here.

like image 351
J D Avatar asked Apr 26 '12 08:04

J D


3 Answers

Look at the disassembly from your favorite compiler. If you're looking to perform this computation for several values of z and c (like calculating a mandelbrot image) I suggest you work on four values at once and put these in SSE registers. If you look at the code in Paul R's answer you could do all these calculations for four values at once:

__m128 z_im, z_re, c_im, c_re; //Four z and c values packed
__m128 re = _mm_sub_ps(_mm_mul_ps(z_re, z_re), _mm_mul_ps(z_im, z_im));
__m128 im = _mm_mul_ps(z_re, z_im);
im = _mm_add_ps(im, im); // Multiply by two
z_re = _mm_add_ps(re, c_re);
z_im = _mm_add_ps(im, c_im);
like image 116
Andreas Brinck Avatar answered Oct 21 '22 18:10

Andreas Brinck


Z = Z*Z + C

That is the mandelbrot fractal iteration.

I'm sure you'll find highly optimized code for this all over the net. I would start at the sourcecode of Xaos and Fractint.

  • Xaos: http://wmi.math.u-szeged.hu/xaos
  • fractint: http://www.fractint.org/
like image 36
Nils Pipenbrinck Avatar answered Oct 21 '22 19:10

Nils Pipenbrinck


You don't need to do this in assembler per se - you can use SSE via intrinsics for an efficient implementation, particularly if you can use single precision.

temp.re = z.re * z.re - z.im * z.im;
temp.im = 2.0 * z.re * z.im;
z.re = temp.re + c.re;
z.im = temp.im + c.im;

If you shuffle your input vectors appropriately then you can get all the multiplies in one instruction (_mm_mul_ps) and the adds in a second instruction (_mm_hadd_ps).

If you need double precision then the same general principle applies but you'll need two multiplies and two horizontal adds.

Note that most modern x86 CPUs have two scalar FPUs so the benefit for double precision in SSE may not be worthwhile - single precision however should definitely be a win.


Here's an initial working implementation using SSE - I think it is more or less debugged now - performance is not much better than scalar code compiled with gcc -O3 though, as gcc does a pretty good job of generating SSE code for this:

static Complex loop_simd(const Complex z0, const Complex c, const int n)
{
    __m128 vz = _mm_set_ps(z0.im, z0.re, z0.im, z0.re);
    const __m128 vc = _mm_set_ps(0.0f, 0.0f, c.im, c.re);
    const __m128 vs = _mm_set_ps(0.0f, 0.0f, -0.0f, 0.0f);
    Complex z[2];
    int i;

    for (i = 0; i < n; ++i)
    {
        __m128 vtemp;

        vtemp = _mm_shuffle_ps(vz, vz, 0x16); // temp = { z.re, z.im, z.im, z.re }
        vtemp = _mm_xor_ps(vtemp, vs);        // temp = { z.re, -z.im, z.im, z.re }
        vtemp = _mm_mul_ps(vtemp, vz);        // temp = { z.re * z.re, - z.im * z.im, z.re * z.im, z.im * z.re }
        vtemp = _mm_hadd_ps(vtemp, vtemp);    // temp = { z.re * z.re - z.im * z.im, 2 * z.re * z.im, ... }
        vz = _mm_add_ps(vtemp, vc);           // temp = { z.re * z.re - z.im * z.im + c.re, 2 * z.re * z.im + c.im, ... }
    }
    _mm_storeu_ps(&z[0].re, vz);
    return z[0];
}

Note that the inner loop is just 6 SSE instructions (it really ought to be 5) + a little housekeeping for the loop itself:

L4:
    movaps  %xmm0, %xmm1
    shufps  $22, %xmm0, %xmm1
    xorps   %xmm3, %xmm1
    mulps   %xmm1, %xmm0
    haddps  %xmm0, %xmm0
    addps   %xmm2, %xmm0
    incl    %eax
    cmpl    %edi, %eax
    jne L4
L2:
like image 7
Paul R Avatar answered Oct 21 '22 18:10

Paul R