Most of the Numpy's function will enable multithreading by default.
for example, I work on a 8-cores intel cpu workstation, if I run a script
import numpy as np x=np.random.random(1000000) for i in range(100000): np.sqrt(x)
the linux top
will show 800% cpu usage during running like Which means numpy automatically detects that my workstation has 8 cores, and np.sqrt
automatically use all 8 cores to accelerate computation.
However, I found a weird bug. If I run a script
import numpy as np import pandas as pd df=pd.DataFrame(np.random.random((10,10))) df+df x=np.random.random(1000000) for i in range(100000): np.sqrt(x)
the cpu usage is 100%!!. It means that if you plus two pandas DataFrame before running any numpy function, the auto multithreading feature of numpy is gone without any warning! This is absolutely not reasonable, why would Pandas dataFrame calculation affect Numpy threading setting? Is it a bug? How to work around this?
PS:
I dig further using Linux perf
tool.
running first script shows
While running second script shows
So both script involves libmkl_vml_avx2.so
, while the first script involves additional libiomp5.so
which seems to be related to openMP.
And since vml means intel vector math library, so according to vml doc I guess at least below functions are all automatically multithreaded
pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.
Q: Is Pandas faster than Numpy? Answer: If the number of rows in the dataset is more than five hundred thousand, then the performance of Pandas is better than NumPy. However, NumPy can be said to be faster in performance than Pandas, up to fifty thousand rows and less.
Pandas is built on top of NumPy, which means the Python pandas package depends on the NumPy package and also pandas intended with many other 3rd party libraries. So we can say that Numpy is required for operating the Pandas.
Pandas uses numexpr
under the hood to calculate some operations, and numexpr
sets the maximal number of threads for vml to 1, when it is imported:
# The default for VML is 1 thread (see #39) set_vml_num_threads(1)
and it gets imported by pandas when df+df
is evaluated in expressions.py:
from pandas.core.computation.check import _NUMEXPR_INSTALLED if _NUMEXPR_INSTALLED: import numexpr as ne
However, Anaconda distribution also uses vml-functionality for such functions as sqrt
, sin
, cos
and so on - and once numexpr
set the maximal number of vml-threads to 1, the numpy-functions no longer use parallelization.
The problem can be easily seen in gdb (using your slow script):
>>> gdb --args python slow.py (gdb) b mkl_serv_domain_set_num_threads function "mkl_serv_domain_set_num_threads" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (mkl_serv_domain_set_num_threads) pending. (gbd) run Thread 1 "python" hit Breakpoint 1, 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so (gdb) bt #0 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so #1 0x00007fffe978026c in _set_vml_num_threads(_object*, _object*) () from /home/ed/anaconda37/lib/python3.7/site-packages/numexpr/interpreter.cpython-37m-x86_64-linux-gnu.so #2 0x00005555556cd660 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:694 ... (gdb) print $rdi $1 = 1
i.e. we can see, numexpr
sets number of threads to 1. Which is later used when vml-sqrt function is called:
(gbd) b mkl_serv_domain_get_max_threads Breakpoint 2 at 0x7fffee65a900 (gdb) (gdb) c Continuing. Thread 1 "python" hit Breakpoint 2, 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so (gdb) bt #0 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so #1 0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so #2 0x00007fffedf78563 in vdSqrt () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_lp64.so #3 0x00007ffff5ac04ac in trivial_two_operand_loop () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so
So we can see numpy uses vml's implementation of vdSqrt
which utilizes mkl_vml_serv_threader_d_1i_1o
to decide whether calculation should be done in parallel and it looks the number of threads:
(gdb) fin Run till exit from #0 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so 0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so (gdb) print $rax $2 = 1
the register %rax
has the maximal number of threads and it is 1.
Now we can use numexpr
to increase the number of vml-threads, i.e.:
import numpy as np import numexpr as ne import pandas as pd df=pd.DataFrame(np.random.random((10,10))) df+df #HERE: reset number of vml-threads ne.set_vml_num_threads(8) x=np.random.random(1000000) for i in range(10000): np.sqrt(x) # now in parallel
Now multiple cores are utilized!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With