I am using distributed, a framework to allow parallel computation. In this, my primary use case is with NumPy. When I include NumPy code that relies on np.linalg
, I get an error with OMP_NUM_THREADS
, which is related to the OpenMP library.
An minimal example:
from distributed import Executor
import numpy as np
e = Executor('144.92.142.192:8786')
def f(x, m=200, n=1000):
A = np.random.randn(m, n)
x = np.random.randn(n)
# return np.fft.fft(x) # tested; no errors
# return np.random.randn(n) # tested; no errors
return A.dot(y).sum() # tested; throws error below
s = [e.submit(f, x) for x in [1, 2, 3, 4]]
s = e.gather(s)
When I test with the linalg test, e.gather
fails as each job throws the following error:
OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.
What should I set OMP_NUM_THREADS
to?
OMP_NUM_THREADS. Sets the maximum number of threads in the parallel region, unless overridden by omp_set_num_threads or num_threads. OMP_DYNAMIC. Specifies whether the OpenMP run time can adjust the number of threads in a parallel region.
If you do not set the OMP_NUM_THREADS environment variable, the number of processors available is the default value to form a new team for the first encountered parallel construct. By default, any nested constructs are run by one thread.
To set the number of threads to use in your program, set the environment variable OMP_NUM_THREADS . OMP_NUM_THREADS sets the number of threads used in OpenMP parallel regions defined in your own code, and within Arm Performance Libraries.
export OMP_NUM_THREADS=1
or
dask-worker --nthreads 1
The OMP_NUM_THREADS
environment variable controls the number of threads that many libraries, including the BLAS
library powering numpy.dot
, use in their computations, like matrix multiply.
The conflict here is that you have two parallel libraries that are calling each other, BLAS, and dask.distributed. Each library is designed to use as many threads as there are logical cores available in the system.
For example if you had eight cores then dask.distributed might run your function f
eight times at once on different threads. The numpy.dot
function call within f
would use eight threads per call, resulting in 64 threads running at once.
This is actually fine, you'll experience a performance hit but everything can run correctly, but it will be slower than if you use just eight threads at a time, either by limiting dask.distributed or by limiting BLAS.
Your system probably has OMP_THREAD_LIMIT
set at some reasonable number like 16 to warn you of this event when it happens.
If you're using MKL blas you might also get some improvement using the TBB threading layer. I haven't actually had occasion to try it out so YMMV.
http://conference.scipy.org/proceedings/scipy2018/anton_malakhov.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With