I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?

That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008, <img src="https://i.stack.imgur.com/7iS3h.png" alt="SIMD beats doubles' ass in this particular configuration."> (source: tirania.org) Picture from Miguel's blog entry.

The answer highly depends on what the library is doing and how it is used. The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way. Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed. The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).

How much speed-up from converting 3D maths to SSE or other SIMD?

5 Answers

In my experience I typically see about a 3x improvement in taking an algorithm from x87 to SSE, and a better than 5x improvement in going to VMX/Altivec (because of complicated issues having to do with pipeline depth, scheduling, etc). But I usually only do this in cases where I have hundreds or thousands of numbers to operate on, not for those where I'm doing one vector at a time ad hoc.

180

answered Oct 24 '22 06:10

Crashworks

That's not the whole story, but it's possible to get further optimizations using SIMD, have a look at Miguel's presentation about when he implemented SIMD instructions with MONO which he held at PDC 2008,

SIMD beats doubles' ass in this particular configuration.
_{(source: tirania.org)}

Picture from Miguel's blog entry.

answered Oct 24 '22 07:10

Henrik

For some very rough numbers: I've heard some people on ompf.org claim 10x speed ups for some hand-optimized ray tracing routines. I've also had some good speed ups. I estimate I got somewhere between 2x and 6x on my routines depending on the problem, and many of these had a couple of unnecessary stores and loads. If you have a huge amount of branching in your code, forget about it, but for problems that are naturally data-parallel you can do quite well.

However, I should add that your algorithms should be designed for data-parallel execution. This means that if you have a generic math library as you've mentioned then it should take packed vectors rather than individual vectors or you'll just be wasting your time.

E.g. Something like

namespace SIMD {
class PackedVec4d
{
  __m128 x;
  __m128 y;
  __m128 z;
  __m128 w;

  //...
};
}

Most problems where performance matters can be parallelized since you'll most likely be working with a large dataset. Your problem sounds like a case of premature optimization to me.

answered Oct 24 '22 06:10

Rehno Lindeque

For 3D operations beware of un-initialized data in your W component. I've seen cases where SSE ops (_mm_add_ps) would take 10x normal time because of bad data in W.

answered Oct 24 '22 06:10

Brian Hayes

The answer highly depends on what the library is doing and how it is used.

The gains can go from a few percent points, to "several times faster", the areas most susceptible of seeing gains are those where you're not dealing with isolated vectors or values, but multiple vectors or values that have to be processed in the same way.

Another area is when you're hitting cache or memory limits, which, again, requires a lot of values/vectors being processed.

The domains where gains can be the most drastic are probably those of image and signal processing, computational simulations, as well general 3D maths operation on meshes (rather than isolated vectors).

answered Oct 24 '22 05:10

Eric Grange

Related questions
                            
                                loop tiling. how to choose block size?
                            
                                Find each element that is less than some element to its right
                            
                                Fastest way to do a case-insensitive substring search in C/C++?
                            
                                Does gcc optimize my cycle with condition?
                            
                                Is there a good tool for MySQL that will help me optimise my queries and index settings? [closed]
                            
                                How cache memory works?
                            
                                Freeing alloca-allocated memory
                            
                                What is the difference between Type Safety and Type Inference?
                            
                                Extreme optimization of integer binary search
                            
                                Is this "move declaration closer to usage" really preferable? [duplicate]
                            
                                Array Access Complexity
                            
                                LLVM opt mem2reg has no effect
                            
                                XML Parsing too slow!
                            
                                Is it *really* worth to use integer over varchar for a set of data?
                            
                                Is it more efficient to branch or multiply?
                            
                                Fastest way to obtain the largest X numbers from a very large unsorted list?
                            
                                Quick Java Optimization Question
                            
                                Why differ(!=,<>) is faster than equal(=,==)?
                            
                                In .NET, which is best, mystring.Length == 0 or mystring == string.Empty? [duplicate]
                            
                                How important is optimization?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How much speed-up from converting 3D maths to SSE or other SIMD?

Tags:

optimization

x86

intel

native

simd

sse

Suma

People also ask

5 Answers

Crashworks

Henrik

Rehno Lindeque

Brian Hayes

Eric Grange

Recent Activity

Donate For Us