I know that Numpy can use different backends like OpenBLAS or MKL. I have also read that MKL is heavily optimized for Intel, so usually people suggest to use OpenBLAS on AMD, right?
I use the following test code:
import numpy as np def testfunc(x): np.random.seed(x) X = np.random.randn(2000, 4000) np.linalg.eigh(X @ X.T) %timeit testfunc(0)
I have tested this code using different CPUs:
I am using the same Conda environment on all three systems. According to np.show_config()
, the Intel system uses the MKL backend for Numpy (libraries = ['mkl_rt', 'pthread']
), whereas the AMD systems use OpenBLAS (libraries = ['openblas', 'openblas']
). The CPU core usage was determined by observing top
in a Linux shell:
The above observations give rise to the following questions:
Update 1: The OpenBLAS version is 0.3.6. I read somewhere, that upgrading to a newer version might help, however, with OpenBLAS updated to 0.3.10, the performance for testfunc
is still 1.55s on AMD Ryzen Threadripper 3970X.
Update 2: Using the MKL backend for Numpy in conjunction with the environment variable MKL_DEBUG_CPU_TYPE=5
(as described here) reduces the run time for testfunc
on AMD Ryzen Threadripper 3970X to only 0.52s, which is actually more or less satisfying. FTR, setting this variable via ~/.profile
did not work for me on Ubuntu 20.04. Also, setting the variable from within Jupyter did not work. So instead I put it into ~/.bashrc
which works now. Anyways, performing 35% faster than an old Intel Xeon, is this all we get, or can we get more out of it?
Update 3: I play around with the number of threads used by MKL/OpenBLAS:
The run times are reported in seconds. The best result of each column is underlined. I used OpenBLAS 0.3.6 for this test. The conclusions from this test:
Update 4: Just for clarification. No, I do not think that (a) this or (b) that answers this question. (a) suggests that "OpenBLAS does nearly as well as MKL", which is a strong contradiction to the numbers I observed. According to my numbers, OpenBLAS performs ridiculously worse than MKL. The question is why. (a) and (b) both suggest using MKL_DEBUG_CPU_TYPE=5
in conjunction with MKL to achieve maximum performance. This might be right, but it does neither explain why OpenBLAS is that dead slow. Neither it explains, why even with MKL and MKL_DEBUG_CPU_TYPE=5
the 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon.
Winner: Tie. Intel's Xeon W-3175X takes the overall performance crown, but the Threadripper 2990WX's great price-to-performance ratio is hard to ignore.
While Intel's processors have historically outdone AMD processors in single-core raw power, AMD's CPUs have made significant improvements in their core count and threads, and outsold its competition as a result, when it comes to multi-core performance.
Product Description. The 3rd Gen Ryzen Threadripper 3990X is the fastest desktop processor ever created.
If you need performance, you'll likely be choosing from AMD's Ryzen Threadripper family of CPUs. The Threadripper 3960X is an insane processor, rocking 24 cores and 48 threads, but you'll need some fast RAM to really allow the processor to take flight.
As of 2021, Intel unfortunately removed the MKL_DEBUG_CPU_TYPE
to prevent people on AMD use the workaround presented in the accepted answer. This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL.
To use the workaround, follow this method:
conda
environment with conda
's and NumPy's MKL=2019.MKL_DEBUG_CPU_TYPE
= 5The commands for the above steps:
conda create -n my_env -c anaconda python numpy mkl=2019.* blas=*=*mkl
conda activate my_env
conda env config vars set MKL_DEBUG_CPU_TYPE=5
And thats it!
I think this should help:
"The best result in the chart is for the TR 3960x using MKL with the environment var MKL_DEBUG_CPU_TYPE=5. AND it is significantly better than the low optimization code path from MKL alone. AND,OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5 set." https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/
How to set up: 'Make the setting permanent by entering MKL_DEBUG_CPU_TYPE=5 into the System Environment Variables. This has several advantages, one of them being that it applies to all instances of Matlab and not just the one opened using the .bat file' https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/?sort=new
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With