I developed a python program that does heavy numerical calculations. I run it on a linux machine with 32 Xeon CPUs, 64GB RAM, and Ubuntu 14.04 64-bit. I launch multiple python instances with different model parameters in parallel to use multiple processes without having to worry about the global interpreter lock (GIL). When I monitor the cpu utilization using <code>htop</code>, I see that all cores are used, however most of the time by kernel. Generally, the kernel time is more than twice the user time. I'm afraid that there is a lot of overhead going on on the system level, but I'm not able to find the cause for this. How would one reduce the high kernel CPU usage? Here are some observation I made: <ul> <li>This effect appears independent of whether I run 10 jobs or 50. If there are fewer jobs than cores, not all cores are used, but the ones that are used still have a high CPU usage by the kernel</li> <li>I implemented the inner loop using numba, but the problem is not related to this, since removing the numba part does not resolve the problem</li> <li>I also though that it might be related to using python2 similar to the problem mentioned in this SO question but switching from python2 to python3 did not change much</li> <li>I measured the total number of context switches performed by the OS, which is about 10000 per second. I'm not sure whether this is a large number</li> <li>I tried increasing the python time slices by setting <code>sys.setcheckinterval(10000)</code> (for python2) and <code>sys.setswitchinterval(10)</code> (for python3) but none of this helped</li> <li>I tried influencing the task scheduler by running <code>schedtool -B PID</code> but this didn't help</li> </ul> Edit: Here is a screenshot of <code>htop</code>: <img src="https://i.stack.imgur.com/PtFZ9.png" alt="enter image description here"> I also ran <code>perf record -a -g</code> and this is the report by <code>perf report -g graph</code>: <pre class="prettyprint"><code>Samples: 1M of event 'cycles', Event count (approx.): 1114297095227 - 95.25% python3 [kernel.kallsyms] [k] _raw_spin_lock_irqsave ◆ - _raw_spin_lock_irqsave ▒ - 95.01% extract_buf ▒ extract_entropy_user ▒ urandom_read ▒ vfs_read ▒ sys_read ▒ system_call_fastpath ▒ __GI___libc_read ▒ - 2.06% python3 [kernel.kallsyms] [k] sha_transform ▒ - sha_transform ▒ - 2.06% extract_buf ▒ extract_entropy_user ▒ urandom_read ▒ vfs_read ▒ sys_read ▒ system_call_fastpath ▒ __GI___libc_read ▒ - 0.74% python3 [kernel.kallsyms] [k] _mix_pool_bytes ▒ - _mix_pool_bytes ▒ - 0.74% __mix_pool_bytes ▒ extract_buf ▒ extract_entropy_user ▒ urandom_read ▒ vfs_read ▒ sys_read ▒ system_call_fastpath ▒ __GI___libc_read ▒ 0.44% python3 [kernel.kallsyms] [k] extract_buf ▒ 0.15% python3 python3.4 [.] 0x000000000004b055 ▒ 0.10% python3 [kernel.kallsyms] [k] memset ▒ 0.09% python3 [kernel.kallsyms] [k] copy_user_generic_string ▒ 0.07% python3 multiarray.cpython-34m-x86_64-linux-gnu.so [.] 0x00000000000b4134 ▒ 0.06% python3 [kernel.kallsyms] [k] _raw_spin_unlock_irqresto▒ 0.06% python3 python3.4 [.] PyEval_EvalFrameEx </code></pre> It seems as if most of the time is spent calling <code>_raw_spin_lock_irqsave</code>. I have no idea what this means, though.

If the problem exists in kernel, you should narrow down a problem using a profiler such as OProfile or perf. I.e. run <code>perf record -a -g</code> and than read profiling data saved into <code>perf data</code> using <code>perf report</code>. See also: linux perf: how to interpret and find hotspots. <hr> In your case high CPU usage is caused by competition for <code>/dev/urandom</code> -- it allows only one thread to read from it, but multiple Python processes are doing so. Python module <code>random</code> is using it only for initialization. I.e: <pre class="prettyprint"><code>$ strace python -c 'import random; while True: random.random()' open("/dev/urandom", O_RDONLY) = 4 read(4, "\16\36\366\36}"..., 2500) = 2500 close(4) <--- /dev/urandom is closed </code></pre> You may also explicitly ask for <code>/dev/urandom</code> by using <code>os.urandom</code> or <code>SystemRandom</code> class. So check your code which is dealing with random numbers.

High Kernel CPU when running multiple python programs

Tags:

performance

python

linux

multiprocessing

I developed a python program that does heavy numerical calculations. I run it on a linux machine with 32 Xeon CPUs, 64GB RAM, and Ubuntu 14.04 64-bit. I launch multiple python instances with different model parameters in parallel to use multiple processes without having to worry about the global interpreter lock (GIL). When I monitor the cpu utilization using htop, I see that all cores are used, however most of the time by kernel. Generally, the kernel time is more than twice the user time. I'm afraid that there is a lot of overhead going on on the system level, but I'm not able to find the cause for this.

How would one reduce the high kernel CPU usage?

Here are some observation I made:

This effect appears independent of whether I run 10 jobs or 50. If there are fewer jobs than cores, not all cores are used, but the ones that are used still have a high CPU usage by the kernel
I implemented the inner loop using numba, but the problem is not related to this, since removing the numba part does not resolve the problem
I also though that it might be related to using python2 similar to the problem mentioned in this SO question but switching from python2 to python3 did not change much
I measured the total number of context switches performed by the OS, which is about 10000 per second. I'm not sure whether this is a large number
I tried increasing the python time slices by setting sys.setcheckinterval(10000) (for python2) and sys.setswitchinterval(10) (for python3) but none of this helped
I tried influencing the task scheduler by running schedtool -B PID but this didn't help

Edit: Here is a screenshot of htop: enter image description here

I also ran perf record -a -g and this is the report by perf report -g graph:

Samples: 1M of event 'cycles', Event count (approx.): 1114297095227                                   
-  95.25%          python3  [kernel.kallsyms]                           [k] _raw_spin_lock_irqsave   ◆
   - _raw_spin_lock_irqsave                                                                          ▒
      - 95.01% extract_buf                                                                           ▒
           extract_entropy_user                                                                      ▒
           urandom_read                                                                              ▒
           vfs_read                                                                                  ▒
           sys_read                                                                                  ▒
           system_call_fastpath                                                                      ▒
           __GI___libc_read                                                                          ▒
-   2.06%          python3  [kernel.kallsyms]                           [k] sha_transform            ▒
   - sha_transform                                                                                   ▒
      - 2.06% extract_buf                                                                            ▒
           extract_entropy_user                                                                      ▒
           urandom_read                                                                              ▒
           vfs_read                                                                                  ▒
           sys_read                                                                                  ▒
           system_call_fastpath                                                                      ▒
           __GI___libc_read                                                                          ▒
-   0.74%          python3  [kernel.kallsyms]                           [k] _mix_pool_bytes          ▒
   - _mix_pool_bytes                                                                                 ▒
      - 0.74% __mix_pool_bytes                                                                       ▒
           extract_buf                                                                               ▒
           extract_entropy_user                                                                      ▒
           urandom_read                                                                              ▒
           vfs_read                                                                                  ▒
           sys_read                                                                                  ▒
           system_call_fastpath                                                                      ▒
           __GI___libc_read                                                                          ▒
    0.44%          python3  [kernel.kallsyms]                           [k] extract_buf              ▒
    0.15%          python3  python3.4                                   [.] 0x000000000004b055       ▒
    0.10%          python3  [kernel.kallsyms]                           [k] memset                   ▒
    0.09%          python3  [kernel.kallsyms]                           [k] copy_user_generic_string ▒
    0.07%          python3  multiarray.cpython-34m-x86_64-linux-gnu.so  [.] 0x00000000000b4134       ▒
    0.06%          python3  [kernel.kallsyms]                           [k] _raw_spin_unlock_irqresto▒
    0.06%          python3  python3.4                                   [.] PyEval_EvalFrameEx

It seems as if most of the time is spent calling _raw_spin_lock_irqsave. I have no idea what this means, though.

274

asked May 14 '15 19:05

David Zwicker

1 Answers

If the problem exists in kernel, you should narrow down a problem using a profiler such as OProfile or perf.

I.e. run perf record -a -g and than read profiling data saved into perf data using perf report. See also: linux perf: how to interpret and find hotspots.

In your case high CPU usage is caused by competition for /dev/urandom -- it allows only one thread to read from it, but multiple Python processes are doing so.

Python module random is using it only for initialization. I.e:

$ strace python -c 'import random;
while True:
    random.random()'
open("/dev/urandom", O_RDONLY)     = 4
read(4, "\16\36\366\36}"..., 2500) = 2500
close(4)                                   <--- /dev/urandom is closed

You may also explicitly ask for /dev/urandom by using os.urandom or SystemRandom class. So check your code which is dealing with random numbers.

answered Oct 19 '22 11:10

myaut

Related questions
                            
                                Access webcam using OpenCV (Python) in Docker?
                            
                                Why doesn't tempfile.SpooledTemporaryFile implement readable, writable, seekable?
                            
                                Duplicate log entries with Google Cloud Stackdriver logging of Python code on Kubernetes Engine
                            
                                pip download without executing setup.py
                            
                                How to install python-distutils for old python versions
                            
                                How to efficiently run multiple Pytorch Processes / Models at once ? Traceback: The paging file is too small for this operation to complete
                            
                                Why is there no speed-up when using pythons multiprocessing for embarassingly parallel problem within a for-loop, with shared numpy data?
                            
                                python setup.py develop to override installed version
                            
                                Parsing mbox files in Python
                            
                                python setup.py configuration to install files in custom directories
                            
                                pymongo connection pooling and client requests
                            
                                print a binary tree on its side
                            
                                Python: Ignore xmlns in elementtree.ElementTree
                            
                                Numpy: Difference between dot(a,b) and (a*b).sum()
                            
                                getting URLError: <urlopen error [Errno 111] Connection refused> in selenium webdriver using python in phantomjs
                            
                                python: merging dictionaries by identical value of key [duplicate]
                            
                                How do I package for distribution a python module that uses a shared library?
                            
                                A simple example of using cmake to build a Windows DLL
                            
                                Run Python script from AJAX or JQuery
                            
                                Auto-import doesn't follow PEP8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With