Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The difference between "simd" construct and "for simd" construct in OpenMP 4.0

Tags:

simd

openmp

The OpenMP 4.0 has introduced the SIMD construct to make use of the SIMD instructions of the cpu. According to the specification http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, there are two constructs to use simd to vectorize a loop. One is "#pragma omp simd" and another is "#pragma omp for simd". According to the specification, both are used to vectorized a for-loop. I also tested and didn't find their difference. Anyone know whether there is a difference between these two constructs?

like image 710
andy90 Avatar asked Jul 24 '15 18:07

andy90


People also ask

Does OpenMP use SIMD?

OpenMP SIMD, introduced in the OpenMP 4.0 standard, targets making vector-friendly loops. By using the simd directive before a loop, the compiler can ignore vector dependencies, make the loop as vector-friendly as possible, and respect the users' intention to have multiple loop iterations executed simultaneously.

What is SIMD loop?

A SIMD loop has logical iterations numbered 0,1,...,N-1 where N is the number of loop iterations, and the logical numbering denotes the sequence in which the iterations would be executed if the associated loop(s) were executed with no SIMD instructions.

How does Pragma OMP parallel for work?

#pragma omp parallel spawns a group of threads, while #pragma omp for divides loop iterations between the spawned threads. You can do both things at once with the fused #pragma omp parallel for directive.


1 Answers

#pragma omp simd (the SIMD construct) instructs the OpenMP compiler to vectorise the loop that follows without worksharing, that is without distributing the loop iterations among multiple threads (if any).

#pragma omp for (the loop construct) instructs the compiler to perform the following loop while distributing the work among the threads of the current team. Therefore, the loop construct is only useful when placed within the lexical or the dynamical scope of a parallel region, e.g.

#pragma omp parallel
{
   ...
   #pragma omp for
   for (i = 0; i < 100; i++) { ... }
   ...
}

#pragma omp for simd (also called loop SIMD construct) combines the two constructs above, i.e. it both distributes the iteration space among the threads in the team and further vectorises the partial loop that each thread performs. If not used within the scope of a parallel region, the for simd construct is equivalent to the simd construct.

It is possible to combine the loop SIMD construct with the parallel construct:

#pragma omp parallel for simd
for (i = 0; i < 100; i++) { ... }

This combined construct creates a parallel region, distributes the iterations of the loop among the threads, and vectorises the partial loops.

Note that sometimes vectorisation and multithreading are not orthogonal with respect to performance. For example, if the loop is memory-bound, then both vectorisation and multithreading alone could lead to exhaustion of the available memory bandwidth and combining them won't bring any further speedup.

Also, when comparing the speedup with #pragma omp simd and with #pragma omp [parallel] for simd, keep in mind that multithreading alone usually delivers better speedup than vectorisation for the same amount of "multiplicity", i.e. a four-way SIMD-ised loop might (and most likely would) execute slower than when the same loop is computed with scalar instructions but split among four threads.

like image 96
Hristo Iliev Avatar answered Oct 08 '22 19:10

Hristo Iliev