Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c++ how to write code the compiler can easily optimize for SIMD?

i'm working in Visual Studio 2008 and in the project settings I see the option for "activate Extended Instruction set" which I can set to None, SSE or SSE2

So the compiler will try to batch instructions together in order to make use of SIMD instructions?

Are there any rules one can follow in how to optimize code such that the compiler can make effiecient assembler using these extensions?

For example currently i'm working on a raytracer. A shader takes some input and calculates from the input an output color, like this:

PixelData data = RayTracer::gatherPixelData(pixel.x, pixel.y);
Color col = shadePixel(data);

would it for example be beneficial to write the shadercode such that it would shade 4 different pixels within one instruction call? something like this:

PixelData data1 = RayTracer::gatherPixelData(pixel1.x, pixel1.y);
...
shadePixels(data1, data2, data3, data4, &col1out, &col2out, &col3out, &col4out);

to process multiple dataunits at once. would This be beneficial for making the compiler use SSE instructions?

thanks!

like image 390
Mat Avatar asked Oct 26 '10 18:10

Mat


2 Answers

i'm working in Visual Studio 2008 and in the project settings I see the option for "activate Extended Instruction set" which I can set to None, SSE or SSE2

So the compiler will try to batch instructions together in order to make use of SIMD instructions?

No, the compiler will not use vector instructions on its own. It will use scalar SSE instructions instead of x87 ones.

What you describe is called "automatic vectorization". Microsoft compilers do not do this, Intel compilers do.

On Microsoft compiler you can use intrinsics to perform manual SSE optimizations.

like image 164
Suma Avatar answered Oct 29 '22 02:10

Suma


Three observations.

  1. The best speedups are not coming from optimizations but from good algorithms. So make sure you get that part right first. Often this means just using the right libraries for your specific domain.

  2. Once you get your algorithms right it is time to Measure. Often there is an 80/20 rule at work. 20% of your code will take 80% of the execution time. But in order to locate that part you need a good profiler. Intel VTune can give you sampling profile from every function and nice reports that pinpoint the performance killers. Another free alternative is AMD CodeAnalyst if you have an AMD CPU.

  3. The compiler autovectorization capability is not a silver bullet. Although it will try really hard (especially Intel C++) you will often need to help it by rewriting the algorithms in vector form. You can often get much better results by handcrafting small portions of the bottleneck code to use SIMD instructions. You can do that in C code (see VJo's link above) using intrinsics or use inline assembly.

Of course parts 2 and 3 form an iterative process. If you are really serious about this then there are some good books on the subject by Intel folks such as The Software Optimization Cookbook and the processor reference manuals.

like image 45
renick Avatar answered Oct 29 '22 02:10

renick