According to https://github.com/joblib/joblib/issues/180, and Is there a safe way to create a subprocess from a thread in python? the Python multiprocessing module does not allow use from within threads. Is this true?
My understanding is that its fine to fork from threads, as long as you aren't holding a threading.Lock when you do so (in the current thread? anywhere in the process?). However, Python's documentation is silent on whether threading.Lock objects are safely shared after a fork.
There's also this: locks shared from the logging module causes issues with fork. https://bugs.python.org/issue6721
I'm not sure how this issue arises. It sounds like the state of any locks in the process are copied into the child process when the current thread forks (which seems like a design error and certain to deadlock). If so, does using multiprocessing really provide any protection against this (since I'm free to create my multiprocessing.Pool after threading.Lock is created and entered by other threads, and after threads have started that using the not-fork-safe logging module) -- the multiprocessing module docs are also silent about whether multiprocessing.Pools should be allocated before Locks.
Does replacing threading.Lock with multiprocessing.Lock everywhere avoid this issue and allow us to safely combine threads and forks?
Yes, it is. From https://docs.python.org/3/library/multiprocessing.html#exchanging-objects-between-processes: Queues are thread and process safe.
Both multithreading and multiprocessing allow Python code to run concurrently. Only multiprocessing will allow your code to be truly parallel. However, if your code is IO-heavy (like HTTP requests), then multithreading will still probably speed up your code.
Python is not by its self thread safe. But there are moves to change this: NoGil, etc. Removing the GIL does not make functions thread-safe.
If your program is IO-bound, both multithreading and multiprocessing in Python will work smoothly. However, If the code is CPU-bound and your machine has multiple cores, multiprocessing would be a better choice.
It sounds like the state of any locks in the process are copied into the child process when the current thread forks (which seems like a design error and certain to deadlock).
It is not a design error, rather, fork()
predates single-process multithreading. The state of all locks is copied into the child process because they're just objects in memory; the entire address-space of the process is copied as is in fork. There are only bad alternatives: either copy all threads over fork, or deny forking in multithreaded application.
Therefore, fork()
ing in a multithreading program was never the safe thing to do, unless then followed by execve()
or exit()
in the child process.
Does replacing threading.Lock with multiprocessing.Lock everywhere avoid this issue and allow us to safely combine threads and forks?
No. Nothing makes it safe to combine threads and forks, it cannot be done.
The problem is that when you have multiple threads in a process, after fork()
system call you cannot continue safely running the program in POSIX systems.
For example, Linux manuals fork(2)
:
- After a
fork(2)
in a multithreaded program, the child can safely call only async-signal-safe functions (seesignal(7)
) until such time as it callsexecve(2)
.
I.e. it is OK to fork()
in a multithreaded program and then only call async-signal-safe C functions (which is a rather limited subset of C functions), until the child process has been replaced with another executable!
Unsafe C function calls in child processes are then for example
malloc
for dynamic memory allocation<stdio.h>
functions for formatted inputpthread_*
functions required for thread state handling, including creation of new threads...Thus there is very little what the child process can actually safely do. Unfortunately CPython core developers have been downplaying the problems caused by this. Even now the documentation says:
Note that safely forking a multithreaded process is problematic.
Quite an euphemism for "impossible".
It is safe to use multiprocessing from a Python process that has multiple threads of control provided that you're not using the fork
start method; in Python 3.4+ it is now possible to change the start method. In previous Python versions including all of Python 2, the POSIX systems always behaved as if fork
was specified as the start method; this would result in undefined behaviour.
The problems are not limited to just threading.Lock
objects but all locks held by the C standard library, the C extensions etc. What is worse that most of the time people would say "it works for me"... until it stops from working.
There were even a cases where a seemingly single-threading Python program is actually multithreading in MacOS X, causing failures and deadlocks upon using multiprocessing.
Another problem is that all opened file handles, their use, shared sockets might behave oddly in programs that forks, but that would be the case even in single-threaded programs.
TL;DR: using multiprocessing
in multithreaded programs, with C extensions, with opened sockets etc:
fork
, If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With