Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much speed-up from converting 3D maths to SSE or other SIMD?

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?

like image 236
Suma Avatar asked Sep 22 '08 14:09

Suma


People also ask

Is SIMD faster?

Using SIMD, we could put both a and b into 128 bit registers, add them together in a single instruction, and then copy the resulting 128 bits into c . That'd be much faster! Here, we have two versions of the function: one which uses AVX2, a specific kind of SIMD feature that lets you do 256-bit operations.

What is SIMD c++?

One approach to leverage vector hardware are SIMD intrinsics, available in all modern C or C++ compilers. SIMD stands for “single Instruction, multiple data”. SIMD instructions are available on many platforms, there's a high chance your smartphone has it too, through the architecture extension ARM NEON.

What is SIMD optimization?

Vectorization or SIMD is a class of parallel computers in Flynn's taxonomy, which refers to computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

What is SIMD vectorization?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).


5 Answers

In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.

like image 180
Crashworks Avatar answered Oct 24 '22 06:10

Crashworks


That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,

SIMD beats doubles' ass in this particular configuration.
(source: tirania.org)

Picture from Miguel's blog entry.

like image 38
Henrik Avatar answered Oct 24 '22 07:10

Henrik


For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.

However, I should add that your algorithms should be designed for data-parallel execution. This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.

E.g. Something like

namespace SIMD {
class PackedVec4d
{
  __m128 x;
  __m128 y;
  __m128 z;
  __m128 w;

  //...
};
}

Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.

like image 45
Rehno Lindeque Avatar answered Oct 24 '22 06:10

Rehno Lindeque


For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.

like image 44
Brian Hayes Avatar answered Oct 24 '22 06:10

Brian Hayes


The answer highly depends on what the library is doing and how it is used.

The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.

Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.

The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).

like image 38
Eric Grange Avatar answered Oct 24 '22 05:10

Eric Grange