Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ThreadPoolExecutor, ProcessPoolExecutor and global variables

I am new to parallelization in general and concurrent.futures in particular. I want to benchmark my script and compare the differences between using threads and processes, but I found that I couldn't even get that running because when using ProcessPoolExecutor I cannot use my global variables.

The following code will output Helloas I expect, but when you change ThreadPoolExecutor for ProcessPoolExecutor, it will output None.

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

greeting = None

def process():
    print(greeting)

    return None


def main():
    with ThreadPoolExecutor(max_workers=1) as executor:
        executor.submit(process)

    return None


def init():
    global greeting
    greeting = 'Hello'

    return None

if __name__ == '__main__':
    init()
    main()

I don't understand why this is the case. In my real program, init is used to set the global variables to CLI arguments, and there are a lot of them. Hence, passing them as arguments does not seem recommended. So how do I pass those global variables to each process/thread correctly?

I know that I can change things around, which will work, but I don't understand why. E.g. the following works for both Executors, but it also means that the globals initialisation has to happen for every instance.

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

greeting = None

def init():
    global greeting
    greeting = 'Hello'

    return None


def main():
    with ThreadPoolExecutor(max_workers=1) as executor:
        executor.submit(process)

    return None

def process():
    init()
    print(greeting)

    return None

if __name__ == '__main__':
    main()

So my main question is, what is actually happening. Why does this code work with threads and not with processes? And, how do I correctly pass set globals to each process/thread without having to re-initialise them for every instance?

(Side note: because I have read that concurrent.futures might behave differently on Windows, I have to note that I am running Python 3.6 on Windows 10 64 bit.)

like image 901
Bram Vanroy Avatar asked Jun 15 '18 08:06

Bram Vanroy


People also ask

What is ProcessPoolExecutor?

The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.

Is ThreadPoolExecutor faster?

The ThreadPoolExecutor is designed to speed-up your program by executing tasks concurrently. Nevertheless, in some use cases, using the ThreadPoolExecutor can make your program slower. Sometimes dramatically slower than performing the same task in a for loop.

Is ThreadPoolExecutor concurrent?

From Python 3.2 onwards a new class called ThreadPoolExecutor was introduced in Python in concurrent. futures module to efficiently manage and create threads.

Is Python ThreadPoolExecutor thread safe?

ThreadPoolExecutor Thread-Safety Although the ThreadPoolExecutor uses threads internally, you do not need to work with threads directly in order to execute tasks and get results. Nevertheless, when accessing resources or critical sections, thread-safety may be a concern.


1 Answers

I'm not sure of the limitations of this approach, but you can pass (serializable?) objects between your main process/thread. This would also help you get rid of the reliance on global vars:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

def process(opts):
    opts["process"] = "got here"
    print("In process():", opts)

    return None


def main(opts):
    opts["main"] = "got here"
    executor = [ProcessPoolExecutor, ThreadPoolExecutor][1]
    with executor(max_workers=1) as executor:
        executor.submit(process, opts)

    return None


def init(opts):                         # Gather CLI opts and populate dict
    opts["init"] = "got here"

    return None


if __name__ == '__main__':
    cli_opts = {"__main__": "got here"} # Initialize dict
    init(cli_opts)                      # Populate dict
    main(cli_opts)                      # Use dict

Works with both executor types.

Edit: Even though it sounds like it won't be a problem for your use case, I'll point out that with ProcessPoolExecutor, the opts dict you get inside process will be a frozen copy, so mutations to it will not be visible across processes nor will they be visible once you return to the __main__ block. ThreadPoolExecutor, on the other hand, will share the dict object between threads.

like image 96
jedwards Avatar answered Oct 22 '22 09:10

jedwards