Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird bug in Pandas and Numpy regarding multithreading

Most of the Numpy's function will enable multithreading by default.

for example, I work on a 8-cores intel cpu workstation, if I run a script

import numpy as np     x=np.random.random(1000000) for i in range(100000):     np.sqrt(x) 

the linux top will show 800% cpu usage during running like enter image description here Which means numpy automatically detects that my workstation has 8 cores, and np.sqrt automatically use all 8 cores to accelerate computation.

However, I found a weird bug. If I run a script

import numpy as np import pandas as pd df=pd.DataFrame(np.random.random((10,10))) df+df x=np.random.random(1000000) for i in range(100000):     np.sqrt(x) 

the cpu usage is 100%!!. enter image description here It means that if you plus two pandas DataFrame before running any numpy function, the auto multithreading feature of numpy is gone without any warning! This is absolutely not reasonable, why would Pandas dataFrame calculation affect Numpy threading setting? Is it a bug? How to work around this?


PS:

I dig further using Linux perf tool.

running first script shows

enter image description here

While running second script shows

enter image description here

So both script involves libmkl_vml_avx2.so, while the first script involves additional libiomp5.so which seems to be related to openMP.

And since vml means intel vector math library, so according to vml doc I guess at least below functions are all automatically multithreaded

enter image description here

like image 846
user15964 Avatar asked Dec 22 '19 14:12

user15964


People also ask

Is NumPy always faster than pandas?

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.

How much faster is NumPy than pandas?

Q: Is Pandas faster than Numpy? Answer: If the number of rows in the dataset is more than five hundred thousand, then the performance of Pandas is better than NumPy. However, NumPy can be said to be faster in performance than Pandas, up to fifty thousand rows and less.

Does Panda depend on NumPy?

Pandas is built on top of NumPy, which means the Python pandas package depends on the NumPy package and also pandas intended with many other 3rd party libraries. So we can say that Numpy is required for operating the Pandas.


1 Answers

Pandas uses numexpr under the hood to calculate some operations, and numexpr sets the maximal number of threads for vml to 1, when it is imported:

# The default for VML is 1 thread (see #39) set_vml_num_threads(1) 

and it gets imported by pandas when df+df is evaluated in expressions.py:

from pandas.core.computation.check import _NUMEXPR_INSTALLED  if _NUMEXPR_INSTALLED:    import numexpr as ne 

However, Anaconda distribution also uses vml-functionality for such functions as sqrt, sin, cos and so on - and once numexpr set the maximal number of vml-threads to 1, the numpy-functions no longer use parallelization.

The problem can be easily seen in gdb (using your slow script):

>>> gdb --args python slow.py (gdb) b mkl_serv_domain_set_num_threads function "mkl_serv_domain_set_num_threads" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (mkl_serv_domain_set_num_threads) pending. (gbd) run Thread 1 "python" hit Breakpoint 1, 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so (gdb) bt  #0  0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so #1  0x00007fffe978026c in _set_vml_num_threads(_object*, _object*) () from /home/ed/anaconda37/lib/python3.7/site-packages/numexpr/interpreter.cpython-37m-x86_64-linux-gnu.so #2  0x00005555556cd660 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:694 ... (gdb) print $rdi $1 = 1 

i.e. we can see, numexpr sets number of threads to 1. Which is later used when vml-sqrt function is called:

(gbd) b mkl_serv_domain_get_max_threads Breakpoint 2 at 0x7fffee65a900 (gdb) (gdb) c Continuing.  Thread 1 "python" hit Breakpoint 2, 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so (gdb) bt #0  0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so #1  0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so #2  0x00007fffedf78563 in vdSqrt () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_lp64.so #3  0x00007ffff5ac04ac in trivial_two_operand_loop () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so 

So we can see numpy uses vml's implementation of vdSqrt which utilizes mkl_vml_serv_threader_d_1i_1o to decide whether calculation should be done in parallel and it looks the number of threads:

(gdb) fin Run till exit from #0  0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so 0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so (gdb) print $rax $2 = 1 

the register %rax has the maximal number of threads and it is 1.

Now we can use numexpr to increase the number of vml-threads, i.e.:

import numpy as np import numexpr as ne import pandas as pd df=pd.DataFrame(np.random.random((10,10))) df+df  #HERE: reset number of vml-threads ne.set_vml_num_threads(8)  x=np.random.random(1000000) for i in range(10000):     np.sqrt(x)     # now in parallel 

Now multiple cores are utilized!

like image 188
ead Avatar answered Sep 21 '22 09:09

ead