Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow OMP: Error #15 when training

I am training my neural network using tensorflow on CentOS HPC. However I got this error at start of the training process:

OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized. OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

The code is for instance segmentation and it worked fine for many people, but failed in my case.

Why it occurs? How to solve it?

like image 263
Kunyu Shi Avatar asked Dec 08 '22 15:12

Kunyu Shi


2 Answers

I had a similar issue on macOS with the same error message (see this question) and found the following reasons:

Problem:

I had a conda environment where Numpy, SciPy and TensorFlow were installed.

Conda is using Intel(R) MKL Optimizations, see docs:

Anaconda has packaged MKL-powered binary versions of some of the most popular numerical/scientific Python libraries into MKL Optimizations for improved performance.

The Intel MKL functions (e.g. FFT, LAPACK, BLAS) are threaded with the OpenMP technology.

But on macOS you do not need MKL, because the Accelerate Framework comes with its own optimization algorithms and already uses OpenMP. That is the reason for the error message: OMP Error #15: ...

Workaround:

You should install all packages without MKL support:

conda install nomkl

and then use

conda install numpy scipy pandas tensorflow

followed by

conda remove mkl mkl-service

For more information see conda MKL Optimizations.

like image 69
J.E.K Avatar answered Dec 10 '22 19:12

J.E.K


I solved this problem by asking a HPC server expert. Maybe useful for Compute Canada system users.

Why it occurs?

This error is due to conflict between a tensorflow pre-built Python wheel(which is specific for Compute Canada system) and conda environment. Quote : "conda is always a bit problematic because it downloads precompiled binaries, mileage may vary..."

How to solve it?

As @abccd pointed out "The best thing to do is to ensure that only a single OpenMP runtime is linked into the process". However, I haven't figured out how to ensure that.

So I uninstalled conda, and install everything in module system using pip install. Then the network works fine.

like image 29
Kunyu Shi Avatar answered Dec 10 '22 17:12

Kunyu Shi