OpenMP 4.0 introduces a new construct called "omp simd". What is the benefit of using this construct over the old "parallel for"? When would each be a better choice over the other? EDIT: Here is an interesting paper related to the SIMD directive.

A simple answer: OpenMP only used to exploit multiple threads for multiple cores. This new <code>simd</code> extention allows you to explicitly use SIMD instructions on modern CPUs, such as Intel's AVX/SSE and ARM's NEON. (Note that a SIMD instruction is executed in a single thread and a single core, by design. However, the meaning of SIMD can be quite expanded for GPGPU. But, but I don't think you need to consider GPGPU for OpenMP 4.0.) So, once you know SIMD instructions, you can use this new construct. <hr> In a modern CPU, roughly there are three types of parallelism: (1) instruction-level parallelism (ILP), (2) thread-level parallelism (TLP), and (3) SIMD instructions (we could say this is vector-level or so). ILP is done automatically by your out-of-order CPUs, or compilers. You can exploit TLP using OpenMP's <code>parallel for</code> and other threading libraries. So, what about SIMD? Intrinsics were a way to use them (as well as compilers' automatic vectorization). OpenMP's <code>simd</code> is a new way to use SIMD. Take a very simple example: <pre class="prettyprint"><code>for (int i = 0; i < N; ++i) A[i] = B[i] + C[i]; </code></pre> The above code computes a sum of two N-dimensional vectors. As you can easily see, there is no (loop-carried) data dependency on the array <code>A[]</code>. This loop is embarrassingly parallel. There could be multiple ways to parallelize this loop. For example, until OpenMP 4.0, this can be parallelized using only <code>parallel for</code> construct. Each thread will perform <code>N/#thread</code> iterations on multiple cores. However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions. Using a SIMD would be like this: <pre class="prettyprint"><code>for (int i = 0; i < N/8; ++i) VECTOR_ADD(A + i, B + i, C + i); </code></pre> This code assumes that (1) the SIMD instruction (<code>VECTOR_ADD</code>) is 256-bit or 8-way (8 * 32 bits); and (2) <code>N</code> is a multiple of 8. An 8-way SIMD instruction means that 8 items in a vector can be executed in a single machine instruction. Note that Intel's latest AVX provides such 8-way (32-bit * 8 = 256 bits) vector instructions. In SIMD, you still use a single core (again, this is only for conventional CPUs, not GPU). But, you can use a hidden parallelism in hardware. Modern CPUs dedicate hardware resources for SIMD instructions, where each SIMD lane can be executed in parallel. You can use thread-level parallelism at the same time. The above example can be further parallelized by <code>parallel for</code>. (However, I have a doubt how many loops can be really transformed to SIMDized loops. The OpenMP 4.0 specification seems a bit unclear on this. So, real performance and practical restrictions would be dependent on actual compilers' implementations.) <hr> To summarize, <code>simd</code> construct allows you to use SIMD instructions, in turn, more parallelism can be exploited along with thread-level parallelism. However, I think actual implementations would matter.

Parallel for vs omp simd: when to use each?

2 Answers

A simple answer:

OpenMP only used to exploit multiple threads for multiple cores. This new simd extention allows you to explicitly use SIMD instructions on modern CPUs, such as Intel's AVX/SSE and ARM's NEON.

(Note that a SIMD instruction is executed in a single thread and a single core, by design. However, the meaning of SIMD can be quite expanded for GPGPU. But, but I don't think you need to consider GPGPU for OpenMP 4.0.)

So, once you know SIMD instructions, you can use this new construct.

In a modern CPU, roughly there are three types of parallelism: (1) instruction-level parallelism (ILP), (2) thread-level parallelism (TLP), and (3) SIMD instructions (we could say this is vector-level or so).

ILP is done automatically by your out-of-order CPUs, or compilers. You can exploit TLP using OpenMP's parallel for and other threading libraries. So, what about SIMD? Intrinsics were a way to use them (as well as compilers' automatic vectorization). OpenMP's simd is a new way to use SIMD.

Take a very simple example:

for (int i = 0; i < N; ++i)   A[i] = B[i] + C[i];

The above code computes a sum of two N-dimensional vectors. As you can easily see, there is no (loop-carried) data dependency on the array A[]. This loop is embarrassingly parallel.

There could be multiple ways to parallelize this loop. For example, until OpenMP 4.0, this can be parallelized using only parallel for construct. Each thread will perform N/#thread iterations on multiple cores.

However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.

Using a SIMD would be like this:

for (int i = 0; i < N/8; ++i)   VECTOR_ADD(A + i, B + i, C + i);

This code assumes that (1) the SIMD instruction (VECTOR_ADD) is 256-bit or 8-way (8 * 32 bits); and (2) N is a multiple of 8.

An 8-way SIMD instruction means that 8 items in a vector can be executed in a single machine instruction. Note that Intel's latest AVX provides such 8-way (32-bit * 8 = 256 bits) vector instructions.

In SIMD, you still use a single core (again, this is only for conventional CPUs, not GPU). But, you can use a hidden parallelism in hardware. Modern CPUs dedicate hardware resources for SIMD instructions, where each SIMD lane can be executed in parallel.

You can use thread-level parallelism at the same time. The above example can be further parallelized by parallel for.

(However, I have a doubt how many loops can be really transformed to SIMDized loops. The OpenMP 4.0 specification seems a bit unclear on this. So, real performance and practical restrictions would be dependent on actual compilers' implementations.)

To summarize, simd construct allows you to use SIMD instructions, in turn, more parallelism can be exploited along with thread-level parallelism. However, I think actual implementations would matter.

155

answered Oct 02 '22 14:10

minjang

The linked-to standard is relatively clear (p 13, lines 19+20)

When any thread encounters a simd construct, the iterations of the loop associated with the construct can be executed by the SIMD lanes that are available to the thread.

SIMD is a sub-thread thing. To make it more concrete, on a CPU you could imagine using simd directives to specifically request vectorization of chunks of loop iterations that individually belong to the same thread. It's exposing the multiple levels of parallelism that exist within a single multicore processor, in a platform-independent way. See for instance the discussion (along with the accelerator stuff) on this intel blog post.

So basically, you'll want to use omp parallel to distribute work onto different threads, which can then migrate to multiple cores; and you'll want to use omp simd to make use of vector pipelines (say) within each core. Normally omp parallel would go on the "outside" to deal with coarser-grained parallel distribution of work and omp simd would go around tight loops inside of that to exploit fine-grained parallelism.

answered Oct 02 '22 14:10

Jonathan Dursi

Related questions
                            
                                How to get IOStream to perform better?
                            
                                Can't Mod Zero?
                            
                                Pure virtual functions may not have an inline definition. Why?
                            
                                Will (and should) there be sockets in C++11?
                            
                                Which Javascript engine would you embed in your application? [closed]
                            
                                What's the benefit of std::back_inserter over std::inserter?
                            
                                Why isn't there a std::construct_at in C++17?
                            
                                Do I need to put constexpr after else-if?
                            
                                Constexpr Math Functions
                            
                                In what ways do C++ exceptions slow down code when there are no exceptions thown?
                            
                                Which Boost libraries are header-only?
                            
                                I cannot pass lambda as std::function
                            
                                Why was std::pow(double, int) removed from C++11?
                            
                                Can I typically/always use std::forward instead of std::move?
                            
                                What does the integer suffix J mean?
                            
                                Why does double in C print fewer decimal digits than C++?
                            
                                Why there is no std::copy_if algorithm?
                            
                                How to emplace object with no-argument constructor into std::map?
                            
                                Separate "include" and "src" folders for application-level code? [closed]
                            
                                Struct padding in C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel for vs omp simd: when to use each?

Tags:

c++

performance

c

simd

openmp

zr.

People also ask

2 Answers

minjang

Jonathan Dursi

Recent Activity

Donate For Us