As I understand, most modern compilers automatically use SIMD instructions for loops where appropriate, if I set the corresponding compiler flag. Since the compiler can only use vectorization if it can be sure that doing so will not change the semantics of the program, it will not use vectorizations in cases where I actually know it's be safe, but the compiler for various reasons thinks its not. Are there explicit vectorization instructions that I can use in plain C++ without libraries, which let me process vectorized data myself instead of relying on the compiler? I imagine it will look something like this: <pre class="prettyprint"><code>double* dest; const double* src1, src2; // ... for (uint32 i = 0; i < n; i += vectorization_size / sizeof(double)) { vectorized_add(&dest[i], &src1[i], &src2[i]); } </code></pre>

TL;DR No guarantees, but KISS and you are likely to get highly optimized code. Measure and inspect the generated code before tinkering with it. You can play with this on online compilers, e.g. gcc.godbolt will vectorize the following straightforward call to <code>std::transform</code> for gcc 5.2 with -O3 <pre class="prettyprint"><code>#include <algorithm> const int sz = 1024; void f(double* src1, double* src2, double* dest) { std::transform(src1 + 0, src1 + sz, src2, dest, [](double lhs, double rhs){ return lhs + rhs; }); } </code></pre> There was a similar Q&A earlier this week. The general theme seems to be that on modern processors and compilers, the more straightforward your code (plain algorithm calls), the more likely you'll get highly optimized (vectorized, unrolled) code.

Explicit vectorization

As I understand, most modern compilers automatically use SIMD instructions for loops where appropriate, if I set the corresponding compiler flag. Since the compiler can only use vectorization if it can be sure that doing so will not change the semantics of the program, it will not use vectorizations in cases where I actually know it's be safe, but the compiler for various reasons thinks its not.

Are there explicit vectorization instructions that I can use in plain C++ without libraries, which let me process vectorized data myself instead of relying on the compiler? I imagine it will look something like this:

double* dest;
const double* src1, src2;
// ...
for (uint32 i = 0; i < n; i += vectorization_size / sizeof(double))
{
    vectorized_add(&dest[i], &src1[i], &src2[i]);
}

What is meant by vectorization?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

What is difference between vectorization and loops?

"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

What is Loop vectorization?

Loop vectorization transforms procedural loops by assigning a processing unit to each pair of operands. Programs spend most of their time within such loops. Therefore, vectorization can significantly accelerate them, especially over large data sets.

What is vectorization CPP?

Vectorization is the use of vector instructions to speed up program execution. Vectorization can be done both by programmers by explicitly writing vector instructions and by a compiler. The latter case is called Auto Vectorization .

Plain C++? No. std::valarray can lead your compiler to the SIMD water, but it can't make it drink.

OpenMP is the least "library" library out there: it's more of a language extension than a library, and all major C++ compilers support it. While primarily and historically used for multicore parallelism, OpenMP 4.0 introduced SIMD-specific constructs which can at least urge your compiler to vectorize certain clearly-vectorizable procedures, even ones with apparently scalar subroutines. It can also help you identify aspects of your code which are preventing the compiler from vectorizing. (And besides... don't you want multicore parallelism too?)

double* dest;
const double* src1, src2;

#pragma omp simd
for (int i = 0; i < n; i++)
{
    dest[i] = src1[i] + src2[i];
}

To go the last mile with reduced-precision operations, multilane aggregation, branch-free masking, etc. really requires an explicit connection to the underlying instruction set, and isn't possible with anything close to "plain C++". OpenMP can get you pretty far, though.

TL;DR No guarantees, but KISS and you are likely to get highly optimized code. Measure and inspect the generated code before tinkering with it.

You can play with this on online compilers, e.g. gcc.godbolt will vectorize the following straightforward call to std::transform for gcc 5.2 with -O3

#include <algorithm>

const int sz = 1024;

void f(double* src1, double* src2, double* dest)
{
    std::transform(src1 + 0, src1 + sz, src2, dest, 
        [](double lhs, double rhs){
        return lhs + rhs;
    });    
}

There was a similar Q&A earlier this week. The general theme seems to be that on modern processors and compilers, the more straightforward your code (plain algorithm calls), the more likely you'll get highly optimized (vectorized, unrolled) code.

Explicit vectorization

Tags:

c++

vectorization

Micheal Bays

People also ask

2 Answers

Sneftel

TemplateRex

Recent Activity

Donate For Us

Explicit vectorization

Tags:

c++

vectorization

Micheal Bays

People also ask

2 Answers

Sneftel

TemplateRex

Related questions

Recent Activity

Donate For Us