Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to troubleshoot/bypass locking problem which appears to be GIL related

An obscure lock-up between two threads seems to be related to the Global Interpreter Lock or some other "behind-the-scenes-lock", and I am at a loss how to continue troubleshooting. Any tips how to eliminate the lock-up would be appreciated.

The issue reproduces (erratically and somewhat random) in a larger set of code. The code is strictly python. Python version is 2.6.5 (on Linux). Hours of troubleshooting has reduced the problem when the lock-up occurs to the following:

  1. The program has only two running threads
  2. The threads concurrently call a method protected by a single threading.RLock
  3. Thread 1 has acquired the lock [plus some other locks] via acquire()
  4. Thread 2 has called acquire() and is confirmed to be waiting for the lock
  5. Thread 1 is able to do print to the console with print(), however it is getting locked up by a simple non-blocking library call

The offensive call in #5 is the function unicode.encode, which should be non-blocking. The following code in Thread 1 at the position where the thread locks will (as expected) print 'A' and 'B':

print('A')
print('B')

However, the following will just print 'A' and block the thread:

print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')

This makes no sense to me at all. There is no logical dead-lock condition existing between the two threads. Thread 1 is being blocked by a non-blocking library call that is not in any way interfering with Thread 2, which is just silently waiting to acquire an RLock. The only cause I can think of for Thread 1 being blocked is that it is waiting for the GIL.

Any thoughts how to troubleshoot this further, or any mechanisms to somehow control or manipulate GIL operation as a workaround?

Edit: Some additional information in response to samplebias (and thank you for the reply). I had problems getting a trace as the issue seems to be very sensitive to anything that may disturb the timing between the two threads. However running strace with only the -f option, after a few iterations I was able to get a trace.

Thread 1 contains these three debug statements which should print two lines 'CHECK_IN' and 'CHECK_TEST' to the console:

print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG

Here is the last page of the strace:

8605  mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605  mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605  mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605  clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606  set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605  futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... set_robust_list resumed> )   = 0
8606  futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  gettimeofday( <unfinished ...>
8606  <... futex resumed> )             = 0
8605  <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606  <... futex resumed> )             = -1 EAGAIN (Resource temporarily unavailable)
8605  <... futex resumed> )             = 0
8606  gettimeofday( <unfinished ...>
8605  futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605  <... futex resumed> )             = 0
8606  futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605  futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606  <... futex resumed> )             = 0
8606  stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606  fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606  mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606  write(1, "CHECK IN\n", 9)         = 9
8606  futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL

And the output of the three lines of code is just the following, before the program locks up:

CHECK IN

So the strace shows how Thread 1 (#8606) writes the 'CHECK_IN' string, and when reaching the unicode.encode call enters a waiting state which never returns.

By the way I am making a couple future imports in all modules to keep with some newer python conventions ...

from __future__ import print_function, unicode_literals

... but I cannot see they should make any difference - especially as the u'hello' string is called out explicitly as a unicode string.

like image 978
CptJeanLuc Avatar asked Mar 30 '11 15:03

CptJeanLuc


People also ask

What is a global interpreter lock GIL and why is it an issue?

The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter. This means that only one thread can be in a state of execution at any point in time.

Why does the GIL exist?

In other words, the GIL is prevents corruption of state. Python programs should never produce a segmentation fault, because only memory safe operations are permitted. The GIL extends this assurance to multi-threaded programs.

What is an alternative to GIL?

ParallelPython: if you really need to scale, ParallelPython provides a mechanism for parallel execution of python code for multiple cores and clusters. Use a different Python implementation (both Jython and IronPython run without a GIL and the PyPy STM branch may also work for your use case).


1 Answers

I can find nothing in the Python source which would cause unicode.encode() to block, and the dummy program I wrote to try to reproduce this runs as expected. You mentioned that Thread 1 has acquired more than 1 lock - have you eliminated those locks as the source of the lock-up?

Does the test case below show the same lock-up in your environment?

import time
import threading

def worker(tid):
    _lock.acquire()
    if not tid:
        # wait for rest of threads to enter acquire
        time.sleep(0.5)    
    print('%d: A' % tid)
    u'hello'.encode('utf-8')
    print('%d: B' % tid)
    _lock.release()

def start(tid):
    th = threading.Thread(target=worker, args=(tid,))
    th.start()
    return th

_num = 2
_lock = threading.RLock()
workers = [start(n) for n in range(_num)]
while all(w.isAlive() for w in workers):
    time.sleep(1)

Output:

0: A
0: B
1: A
1: B

You can also run strace on your program to work out where the process is blocked. For example with the above script:

% strace -fTr -o trace.out python lockup.py

The -o trace.out flag tells strace to write the output to a file. You can omit that and strace will print to stderr.

The contents of trace.out should show you all of the system calls made by the program, with each line prefixed with thread id and the relative time between system calls. The end of the line will have the time spent inside that system call. I've annotated the last few syscalls with the corresponding Python code:

# thread 0 time.sleep(0.5) completes
24778      0.500124 <... select resumed> ) = 0 (Timeout) <0.500599>
# preparing to print()
24778      0.000071 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000017>
24778      0.000058 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fe90a6000 <0.000018>
# print("0: A\n")..
24778      0.000079 write(1, "0: A\n", 5) = 5 <0.000023>
24778      0.000106 write(1, "0: B\n", 5) = 5 <0.000056>
# thread 0 _lock.release()
24778      0.000114 futex(0xe0f3c0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000024>
24778      0.000108 madvise(0x7f8fe7266000, 8368128, MADV_DONTNEED) = 0 <0.000030>
# thread 0 exit
24778      0.000072 _exit(0)            = ?
# thread 1 _lock.acquire()
24779      0.000050 <... futex resumed> ) = 0 <0.500774>
# thread 1 print("1: A\n") and so on..
24779      0.000052 write(1, "1: A\n", 5) = 5 <0.000026>
24779      0.000086 write(1, "1: B\n", 5) = 5 <0.000026>
24779      0.000099 madvise(0x7f8fe6a65000, 8368128, MADV_DONTNEED) = 0 <0.000024>
24779      0.000064 _exit(0)            = ?
24777      0.499956 <... select resumed> ) = 0 (Timeout) <1.001138>
24777      0.000132 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f8fe8c7c8f0}, {0x4d9a90, [], SA_RESTORER, 0x7f8fe8c7c8f0}, 8) = 0 <0.000025>
# main thread process exit
24777      0.002349 exit_group(0)       = ?
like image 95
samplebias Avatar answered Oct 18 '22 22:10

samplebias