I know that Numpy can use different backends like OpenBLAS or MKL. I have also read that MKL is heavily optimized for Intel, so usually people suggest to use OpenBLAS on AMD, right? I use the following test code: <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np def testfunc(x): np.random.seed(x) X = np.random.randn(2000, 4000) np.linalg.eigh(X @ X.T) %timeit testfunc(0) </code></pre> I have tested this code using different CPUs: <ul> <li>On Intel Xeon E5-1650 v3, this code performs in 0.7s using 6 out of 12 cores.</li> <li>On AMD Ryzen 5 2600, this code performs in 1.45s using all 12 cores.</li> <li>On AMD Ryzen Threadripper 3970X, this code performs in 1.55s using all 64 cores.</li> </ul> I am using the same Conda environment on all three systems. According to <code>np.show_config()</code>, the Intel system uses the MKL backend for Numpy (<code>libraries = ['mkl_rt', 'pthread']</code>), whereas the AMD systems use OpenBLAS (<code>libraries = ['openblas', 'openblas']</code>). The CPU core usage was determined by observing <code>top</code> in a Linux shell: <ul> <li>For the Intel Xeon E5-1650 v3 CPU (6 physical cores), it shows 12 cores (6 idling).</li> <li>For the AMD Ryzen 5 2600 CPU (6 physical cores), it shows 12 cores (none idling).</li> <li>For the AMD Ryzen Threadripper 3970X CPU (32 physical cores), it shows 64 cores (none idling).</li> </ul> The above observations give rise to the following questions: <ol> <li>Is that normal, that linear algebra on up-to-date AMD CPUs using OpenBLAS is that much slower than on a six-year-old Intel Xeon? (also addressed in Update 3) </li> <li>Judging by the observations of the CPU load, it looks like Numpy utilizes the multi-core environment in all three cases. How can it be that the Threadripper is even slower than the Ryzen 5, even though it has almost six times as many physical cores? (also see Update 3) </li> <li>Is there anything that can be done to speed up the computations on the Threadripper? (partially answered in Update 2) </li> </ol> <hr> Update 1: The OpenBLAS version is 0.3.6. I read somewhere, that upgrading to a newer version might help, however, with OpenBLAS updated to 0.3.10, the performance for <code>testfunc</code> is still 1.55s on AMD Ryzen Threadripper 3970X. <hr> Update 2: Using the MKL backend for Numpy in conjunction with the environment variable <code>MKL_DEBUG_CPU_TYPE=5</code> (as described here) reduces the run time for <code>testfunc</code> on AMD Ryzen Threadripper 3970X to only 0.52s, which is actually more or less satisfying. FTR, setting this variable via <code>~/.profile</code> did not work for me on Ubuntu 20.04. Also, setting the variable from within Jupyter did not work. So instead I put it into <code>~/.bashrc</code> which works now. Anyways, performing 35% faster than an old Intel Xeon, is this all we get, or can we get more out of it? <hr> Update 3: I play around with the number of threads used by MKL/OpenBLAS: <img src="https://i.stack.imgur.com/uVJV8.png" alt="run time by the number of threads and library"> The run times are reported in seconds. The best result of each column is underlined. I used OpenBLAS 0.3.6 for this test. The conclusions from this test: <ul> <li> The single-core performance of the Threadripper using OpenBLAS is a bit better than the single-core performance of the Xeon (11% faster), however, its single-core performance is even better when using MKL (34% faster).</li> <li> The multi-core performance of the Threadripper using OpenBLAS is ridiculously worse than the multi-core performance of the Xeon. What is going on here?</li> <li> The Threadripper performs overall better than the Xeon, when MKL is used (26% to 38% faster than Xeon). The overall best performance is achieved by the Threadripper using 16 threads and MKL (36% faster than Xeon).</li> </ul> <hr> Update 4: Just for clarification. No, I do not think that (a) this or (b) that answers this question. (a) suggests that "OpenBLAS does nearly as well as MKL", which is a strong contradiction to the numbers I observed. According to my numbers, OpenBLAS performs ridiculously worse than MKL. The question is why. (a) and (b) both suggest using <code>MKL_DEBUG_CPU_TYPE=5</code> in conjunction with MKL to achieve maximum performance. This might be right, but it does neither explain why OpenBLAS is that dead slow. Neither it explains, why even with MKL and <code>MKL_DEBUG_CPU_TYPE=5</code> the 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon.

As of 2021, Intel unfortunately removed the <code>MKL_DEBUG_CPU_TYPE</code> to prevent people on AMD use the workaround presented in the accepted answer. This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL. To use the workaround, follow this method: <ol> <li>Create a <code>conda</code> environment with <code>conda</code>'s and NumPy's MKL=2019.</li> <li>Activate the environment</li> <li>Set <code>MKL_DEBUG_CPU_TYPE</code> = 5</li> </ol> The commands for the above steps: <ol> <li><code>conda create -n my_env -c anaconda python numpy mkl=2019.* blas=*=*mkl</code></li> <li><code>conda activate my_env</code></li> <li><code>conda env config vars set MKL_DEBUG_CPU_TYPE=5</code></li> </ol> And thats it!

Why is Numpy with Ryzen Threadripper so much slower than Xeon?

Tags:

performance

python

numpy

intel

amd-processor

I know that Numpy can use different backends like OpenBLAS or MKL. I have also read that MKL is heavily optimized for Intel, so usually people suggest to use OpenBLAS on AMD, right?

I use the following test code:

import numpy as np  def testfunc(x):     np.random.seed(x)     X = np.random.randn(2000, 4000)     np.linalg.eigh(X @ X.T)  %timeit testfunc(0)

I have tested this code using different CPUs:

On Intel Xeon E5-1650 v3, this code performs in 0.7s using 6 out of 12 cores.
On AMD Ryzen 5 2600, this code performs in 1.45s using all 12 cores.
On AMD Ryzen Threadripper 3970X, this code performs in 1.55s using all 64 cores.

I am using the same Conda environment on all three systems. According to np.show_config(), the Intel system uses the MKL backend for Numpy (libraries = ['mkl_rt', 'pthread']), whereas the AMD systems use OpenBLAS (libraries = ['openblas', 'openblas']). The CPU core usage was determined by observing top in a Linux shell:

For the Intel Xeon E5-1650 v3 CPU (6 physical cores), it shows 12 cores (6 idling).
For the AMD Ryzen 5 2600 CPU (6 physical cores), it shows 12 cores (none idling).
For the AMD Ryzen Threadripper 3970X CPU (32 physical cores), it shows 64 cores (none idling).

The above observations give rise to the following questions:

Is that normal, that linear algebra on up-to-date AMD CPUs using OpenBLAS is that much slower than on a six-year-old Intel Xeon? (also addressed in Update 3)
Judging by the observations of the CPU load, it looks like Numpy utilizes the multi-core environment in all three cases. How can it be that the Threadripper is even slower than the Ryzen 5, even though it has almost six times as many physical cores? (also see Update 3)
Is there anything that can be done to speed up the computations on the Threadripper? (partially answered in Update 2)

Update 1: The OpenBLAS version is 0.3.6. I read somewhere, that upgrading to a newer version might help, however, with OpenBLAS updated to 0.3.10, the performance for testfunc is still 1.55s on AMD Ryzen Threadripper 3970X.

Update 2: Using the MKL backend for Numpy in conjunction with the environment variable MKL_DEBUG_CPU_TYPE=5 (as described here) reduces the run time for testfunc on AMD Ryzen Threadripper 3970X to only 0.52s, which is actually more or less satisfying. FTR, setting this variable via ~/.profile did not work for me on Ubuntu 20.04. Also, setting the variable from within Jupyter did not work. So instead I put it into ~/.bashrc which works now. Anyways, performing 35% faster than an old Intel Xeon, is this all we get, or can we get more out of it?

Update 3: I play around with the number of threads used by MKL/OpenBLAS:

run time by the number of threads and library

The run times are reported in seconds. The best result of each column is underlined. I used OpenBLAS 0.3.6 for this test. The conclusions from this test:

The single-core performance of the Threadripper using OpenBLAS is a bit better than the single-core performance of the Xeon (11% faster), however, its single-core performance is even better when using MKL (34% faster).
The multi-core performance of the Threadripper using OpenBLAS is ridiculously worse than the multi-core performance of the Xeon. What is going on here?
The Threadripper performs overall better than the Xeon, when MKL is used (26% to 38% faster than Xeon). The overall best performance is achieved by the Threadripper using 16 threads and MKL (36% faster than Xeon).

Update 4: Just for clarification. No, I do not think that (a) this or (b) that answers this question. (a) suggests that "OpenBLAS does nearly as well as MKL", which is a strong contradiction to the numbers I observed. According to my numbers, OpenBLAS performs ridiculously worse than MKL. The question is why. (a) and (b) both suggest using MKL_DEBUG_CPU_TYPE=5 in conjunction with MKL to achieve maximum performance. This might be right, but it does neither explain why OpenBLAS is that dead slow. Neither it explains, why even with MKL and MKL_DEBUG_CPU_TYPE=5 the 32-core Threadripper is only 36% faster than the six-year-old 6-core Xeon.

878

asked Jul 07 '20 20:07

theV0ID

2 Answers

As of 2021, Intel unfortunately removed the MKL_DEBUG_CPU_TYPE to prevent people on AMD use the workaround presented in the accepted answer. This means that the workaround no longer works, and AMD users have to either switch to OpenBLAS or keep using MKL.

To use the workaround, follow this method:

Create a conda environment with conda's and NumPy's MKL=2019.
Activate the environment
Set MKL_DEBUG_CPU_TYPE = 5

The commands for the above steps:

conda create -n my_env -c anaconda python numpy mkl=2019.* blas=*=*mkl
conda activate my_env
conda env config vars set MKL_DEBUG_CPU_TYPE=5

And thats it!

answered Sep 21 '22 04:09

AstroTeen

I think this should help:

"The best result in the chart is for the TR 3960x using MKL with the environment var MKL_DEBUG_CPU_TYPE=5. AND it is significantly better than the low optimization code path from MKL alone. AND,OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5 set." https://www.pugetsystems.com/labs/hpc/How-To-Use-MKL-with-AMD-Ryzen-and-Threadripper-CPU-s-Effectively-for-Python-Numpy-And-Other-Applications-1637/

How to set up: 'Make the setting permanent by entering MKL_DEBUG_CPU_TYPE=5 into the System Environment Variables. This has several advantages, one of them being that it applies to all instances of Matlab and not just the one opened using the .bat file' https://www.reddit.com/r/matlab/comments/dxn38s/howto_force_matlab_to_use_a_fast_codepath_on_amd/?sort=new

answered Sep 20 '22 04:09

poloniki

Related questions
                            
                                Take screenshot of full page with Selenium Python with chromedriver
                            
                                how to convert list of dict to dict
                            
                                ubuntu 14.04, pip cannot upgrade matplotllib
                            
                                Extracting a URL in Python
                            
                                Change main plot legend label text
                            
                                scrapy: Call a function when a spider quits
                            
                                How to hide Firefox window (Selenium WebDriver)?
                            
                                Formatting timedelta objects [duplicate]
                            
                                Add days to dates in dataframe
                            
                                Disable link to edit object in django's admin (display list only)?
                            
                                Better way to swap elements in a list?
                            
                                Error: Segmentation fault (core dumped)
                            
                                super(type, obj): obj must be an instance or subtype of type
                            
                                A general tree implementation? [closed]
                            
                                The equivalent of a GOTO in python [duplicate]
                            
                                parsing .properties file in Python
                            
                                Measure website load time with Python requests
                            
                                Removing control characters from a string in python
                            
                                Tensorflow Custom TFLite java.lang.NullPointerException: Cannot allocate memory for the interpreter
                            
                                Packaging legacy Fortran in Python. Is it OK to use setuptools and numpy.distutils?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With