I have a code with following structure
#pragma omp parallel
{
#omp for nowait
{
// first for loop
}
#omp for nowait
{
// first for loop
}
#pragma barrier
<-- #pragma omp single/critical/atomic --> not sure
dgemm_(....)
#pragma omp for
{
// yet another for loop
}
}
For dgemm_, I link with multithreaded mkl. I want mkl to use all available 8 threads. What is the best way to do so?
This is a case of nested parallelism. It is supported by MKL, but it only works if your executable is built using the Intel C/C++ compiler. The reason for that restriction is that MKL uses Intel's OpenMP runtime and that different OMP runtimes do not play well with each other.
Once that is sorted out, you should enable nested parallelism by setting OMP_NESTED
to TRUE
and disable MKL's detection of nested parallelism by setting MKL_DYNAMIC
to FALSE
. If the data to be processes with dgemm_
is shared, then you have to invoke the latter from within a single
construct. If each thread processes its own private data, then you don't need any synchronisation constructs, but using multithreaded MKL won't give you any benefit too. Therefore I would assume that your case is the former.
To summarise:
#pragma omp single
dgemm_(...);
and run with:
$ MKL_DYNAMIC=FALSE MKL_NUM_THREADS=8 OMP_NUM_THREADS=8 OMP_NESTED=TRUE ./exe
You could also set the parameters with the appropriate calls:
mkl_set_dynamic(0);
mkl_set_num_threads(8);
omp_set_nested(1);
#pragma omp parallel num_threads(8) ...
{
...
}
though I would prefer to use environment variables instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With