This is the message received from running a script to check if Tensorflow is working:
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I noticed that it has mentioned SSE4.2 and AVX,
I just ran into this same problem, it seems like Yaroslav Bulatov's suggestion doesn't cover SSE4.2 support, adding --copt=-msse4.2
would suffice. In the end, I successfully built with
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package
without getting any warning or errors.
Probably the best choice for any system is:
bazel build -c opt --copt=-march=native --copt=-mfpmath=both --config=cuda -k //tensorflow/tools/pip_package:build_pip_package
(Update: the build scripts may be eating -march=native
, possibly because it contains an =
.)
-mfpmath=both
only works with gcc, not clang. -mfpmath=sse
is probably just as good, if not better, and is the default for x86-64. 32-bit builds default to -mfpmath=387
, so changing that will help for 32-bit. (But if you want high-performance for number crunching, you should build 64-bit binaries.)
I'm not sure what TensorFlow's default for -O2
or -O3
is. gcc -O3
enables full optimization including auto-vectorization, but that sometimes can make code slower.
What this does: --copt
for bazel build
passes an option directly to gcc for compiling C and C++ files (but not linking, so you need a different option for cross-file link-time-optimization)
x86-64 gcc defaults to using only SSE2 or older SIMD instructions, so you can run the binaries on any x86-64 system. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). That's not what you want. You want to make a binary that takes advantage of all the instructions your CPU can run, because you're only running this binary on the system where you built it.
-march=native
enables all the options your CPU supports, so it makes -mavx512f -mavx2 -mavx -mfma -msse4.2
redundant. (Also, -mavx2
already enables -mavx
and -msse4.2
, so Yaroslav's command should have been fine). Also if you're using a CPU that doesn't support one of these options (like FMA), using -mfma
would make a binary that faults with illegal instructions.
TensorFlow's ./configure
defaults to enabling -march=native
, so using that should avoid needing to specify compiler options manually.
-march=native
enables -mtune=native
, so it optimizes for your CPU for things like which sequence of AVX instructions is best for unaligned loads.
This all applies to gcc, clang, or ICC. (For ICC, you can use -xHOST
instead of -march=native
.)
Let's start with the explanation of why do you see these warnings in the first place.
Most probably you have not installed TF from source and instead of it used something like pip install tensorflow
. That means that you installed pre-built (by someone else) binaries which were not optimized for your architecture. And these warnings tell you exactly this: something is available on your architecture, but it will not be used because the binary was not compiled with it. Here is the part from documentation.
TensorFlow checks on startup whether it has been compiled with the optimizations available on the CPU. If the optimizations are not included, TensorFlow will emit warnings, e.g. AVX, AVX2, and FMA instructions not included.
Good thing is that most probably you just want to learn/experiment with TF so everything will work properly and you should not worry about it
What are SSE4.2 and AVX?
Wikipedia has a good explanation about SSE4.2 and AVX. This knowledge is not required to be good at machine-learning. You may think about them as a set of some additional instructions for a computer to use multiple data points against a single instruction to perform operations which may be naturally parallelized (for example adding two arrays).
Both SSE and AVX are implementation of an abstract idea of SIMD (Single instruction, multiple data), which is
a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment
This is enough to answer your next question.
How do these SSE4.2 and AVX improve CPU computations for TF tasks
They allow a more efficient computation of various vector (matrix/tensor) operations. You can read more in these slides
How to make Tensorflow compile using the two libraries?
You need to have a binary which was compiled to take advantage of these instructions. The easiest way is to compile it yourself. As Mike and Yaroslav suggested, you can use the following bazel command
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With