Enabling arch:SSE2 makes program slower

Question

On Visual Studio 2010, when I enable enhanced instruction sets on the following code, the execution time is actually increased.

void add(float * input1, float * input2, float * output, int size)
{
    for(int iter = 0; iter < size; iter++)
    {
        output[iter] = input1[iter] * input2[iter];
    }
}

int main()
{

    const int SIZE = 10000000;
    float *in1 = new float[SIZE];
    float *in2 = new float[SIZE];
    float *out = new float[SIZE];
    for(int iter = 0; iter < SIZE; iter++)
    {
        in1[iter] = std::rand();
        in2[iter] = std::rand();
        out[iter] = std::rand();
    }
    clock_t start = clock();
    for(int iter = 0; iter < 100; iter++)
    {
        add(in1, in2, out, SIZE);
    }
    clock_t end = clock();
    double time = difftime(end,start)/(double)CLOCKS_PER_SEC;

    system("PAUSE");
    return 0;
}

I am consistently getting about 2.0 seconds for time variable with SSE2 enabled, but about 1.7 seconds when it is "Not Set". I am building on Windows 7 64bit, VS 2010 professional, Release configuration, Optimize for speed.

Is there any explanation for why enabling SSE causes longer execution time?

ComicSansMS · Accepted Answer

There is an overhead in SSE code for moving values into and from the SSE registers, which may outweigh the performance benefits of SSE if you are only doing very few, simple calculations as is the case with your example.

Also note that this overhead becomes significantly larger if your data is not 16-byte aligned.

Walter · Answer

IMO, it is often not a good idea to rely on the compiler to do these optimisations. Your code should run faster (unless the compiler already does it for you, which however does not seem to be the case). I suggest to

1 make sure your array is 16byte aligned

2 use SSE intrinsics in your inlined add function:

#include <xmmintrin.h>
inline void add(const float * input1, const float * input2, float * output, int size)
{
   // assuming here that 
   // - all 3 arrays are 16-byte aligned
   // - size is a multiple of 4
   for(int iter = 0; iter < size; iter += 4)
     _mm_store_ps( output+iter, _mm_mul_ps( _mm_load_ps(input1+iter),
                                            _mm_load_ps(input2+iter) ) );
}

if this does not produce faster code, then indeed the loading and storing create too much overhead for a single multiplication operation.

Enabling arch:SSE2 makes program slower

Tags:

c++

sse

contrapsych

2 Answers

ComicSansMS

Walter

Recent Activity

Donate For Us

Enabling arch:SSE2 makes program slower

Tags:

c++

sse

contrapsych

2 Answers

ComicSansMS

Walter

Related questions

Recent Activity

Donate For Us