How much effort do you have to put in to get gains from using SSE?

Question

Case One

Say you have a little class:

class Point3D
{
private:
  float x,y,z;
public:
  operator+=()
  
  ...etc
};

Point3D &Point3D::operator+=(Point3D &other)
{
  this->x += other.x;
  this->y += other.y;
  this->z += other.z;
}

A naive use of SSE would simply replace these function bodies with using a few intrinsics. But would we expect this to make much difference? MMX used to involve costly state cahnges IIRC, does SSE or are they just like other instructions? And even if there's no direct "use SSE" overhead, would moving the values into SSE registers and back out again really make it any faster?

Case Two

Instead, you're working with a less OO-based code base. Rather than an array/vector of Point3D objects, you simply have a big array of floats:

float coordinateData[NUM_POINTS*3];

void add(int i,int j) //yes it's unsafe, no overlap check... example only
{
  for (int x=0;x<3;++x)
  {
    coordinateData[i*3+x] += coordinateData[j*3+x];
  }
}

What about use of SSE here? Any better?

In conclusion

Is trying to optimise single vector operations using SSE actually worthwhile, or is it really only valuable when doing bulk operations?

Paul R · Accepted Answer

In general you will need to take additional steps to get the best out of SSE (or any other SIMD architecture):

data needs to be 16 byte aligned (ideally)
data needs to be contiguous
you need enough data to make the SIMD operation worthwhile
you need to coalesce as many operations as you can to mitigate the costs of loads/stores
you need to be aware of the cache/memory hierarchy and its performance impact (e.g. use strip-mining/tiling)

Andrey · Answer

it is valuable if your is case is that you do a lot of same calculations on range of data. for example you calculate square roots of many-many equations. you can load 4 values in sse registers and call operations once. this will increase performance by 4.

and there are libraries that have all sse optimization inside them. don't reinvent bicycle.

nsanders · Answer

This Gamasutra article shows what it takes to make fast SSE-based code. It covers your "Case 1" in detail.

The source code is available from the author's homepage.

Also Slides + text: SIMD at Insomniac Games (GDC 2015) discuss why using a SIMD vector to hold a single x,y,z,(padding) geometry vector is not efficient. (Because you'll need horizontal shuffles and a scalar sqrt to do things like the length of a vector, sqrt(sum of squares), compared to doing 4 lengths in parallel from 3 vectors of x0,x1,x2,x3, y0-3, z0-3.)

See also other links in the SSE tag wiki.

How much effort do you have to put in to get gains from using SSE?

Tags:

c++

sse

Case One

Case Two

In conclusion

Mr. Boy

3 Answers

Paul R

Andrey

nsanders

Recent Activity

Donate For Us

How much effort do you have to put in to get gains from using SSE?

Tags:

c++

sse

Case One

Case Two

In conclusion

Mr. Boy

3 Answers

Paul R

Andrey

nsanders

Related questions

Recent Activity

Donate For Us