Efficient complex arithmetic in x86 assembly for a Mandelbrot loop

Question

Consider the following program:

for i=1 to 10000000 do
  z <- z*z + c

where z and c are complex numbers.

What are efficient x86 assembler implementations of this program using x87 vs SSE and single vs double precision arithmetic?

EDIT I know I can write this in another language and trust the compiler to generate optimal machine code for me but I am doing this to learn how to write optimal x86 assembler myself. I have already looked at the code generated by gcc -O2 and my guess is that there is a lot of room for improvement but I am not adept enough to write optimal x86 assembler by hand myself so I am asking for help here.

Andreas Brinck · Accepted Answer

Look at the disassembly from your favorite compiler. If you're looking to perform this computation for several values of z and c (like calculating a mandelbrot image) I suggest you work on four values at once and put these in SSE registers. If you look at the code in Paul R's answer you could do all these calculations for four values at once:

__m128 z_im, z_re, c_im, c_re; //Four z and c values packed
__m128 re = _mm_sub_ps(_mm_mul_ps(z_re, z_re), _mm_mul_ps(z_im, z_im));
__m128 im = _mm_mul_ps(z_re, z_im);
im = _mm_add_ps(im, im); // Multiply by two
z_re = _mm_add_ps(re, c_re);
z_im = _mm_add_ps(im, c_im);

Nils Pipenbrinck · Answer

Z = Z*Z + C

That is the mandelbrot fractal iteration.

I'm sure you'll find highly optimized code for this all over the net. I would start at the sourcecode of Xaos and Fractint.

Xaos: http://wmi.math.u-szeged.hu/xaos
fractint: http://www.fractint.org/

Paul R · Answer

You don't need to do this in assembler per se - you can use SSE via intrinsics for an efficient implementation, particularly if you can use single precision.

temp.re = z.re * z.re - z.im * z.im;
temp.im = 2.0 * z.re * z.im;
z.re = temp.re + c.re;
z.im = temp.im + c.im;

If you shuffle your input vectors appropriately then you can get all the multiplies in one instruction (_mm_mul_ps) and the adds in a second instruction (_mm_hadd_ps).

If you need double precision then the same general principle applies but you'll need two multiplies and two horizontal adds.

Note that most modern x86 CPUs have two scalar FPUs so the benefit for double precision in SSE may not be worthwhile - single precision however should definitely be a win.

Here's an initial working implementation using SSE - I think it is more or less debugged now - performance is not much better than scalar code compiled with gcc -O3 though, as gcc does a pretty good job of generating SSE code for this:

static Complex loop_simd(const Complex z0, const Complex c, const int n)
{
    __m128 vz = _mm_set_ps(z0.im, z0.re, z0.im, z0.re);
    const __m128 vc = _mm_set_ps(0.0f, 0.0f, c.im, c.re);
    const __m128 vs = _mm_set_ps(0.0f, 0.0f, -0.0f, 0.0f);
    Complex z[2];
    int i;

    for (i = 0; i < n; ++i)
    {
        __m128 vtemp;

        vtemp = _mm_shuffle_ps(vz, vz, 0x16); // temp = { z.re, z.im, z.im, z.re }
        vtemp = _mm_xor_ps(vtemp, vs);        // temp = { z.re, -z.im, z.im, z.re }
        vtemp = _mm_mul_ps(vtemp, vz);        // temp = { z.re * z.re, - z.im * z.im, z.re * z.im, z.im * z.re }
        vtemp = _mm_hadd_ps(vtemp, vtemp);    // temp = { z.re * z.re - z.im * z.im, 2 * z.re * z.im, ... }
        vz = _mm_add_ps(vtemp, vc);           // temp = { z.re * z.re - z.im * z.im + c.re, 2 * z.re * z.im + c.im, ... }
    }
    _mm_storeu_ps(&z[0].re, vz);
    return z[0];
}

Note that the inner loop is just 6 SSE instructions (it really ought to be 5) + a little housekeeping for the loop itself:

L4:
    movaps  %xmm0, %xmm1
    shufps  $22, %xmm0, %xmm1
    xorps   %xmm3, %xmm1
    mulps   %xmm1, %xmm0
    haddps  %xmm0, %xmm0
    addps   %xmm2, %xmm0
    incl    %eax
    cmpl    %edi, %eax
    jne L4
L2:

Efficient complex arithmetic in x86 assembly for a Mandelbrot loop

Tags:

c

x86

assembly

sse

complex-numbers

J D

3 Answers

Andreas Brinck

Nils Pipenbrinck

Paul R

Recent Activity

Donate For Us

Efficient complex arithmetic in x86 assembly for a Mandelbrot loop

Tags:

c

x86

assembly

sse

complex-numbers

J D

3 Answers

Andreas Brinck

Nils Pipenbrinck

Paul R

Related questions

Recent Activity

Donate For Us