Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Workaround for using __name__=='__main__' in Python multiprocessing

As we all know we need to protect the main() when running code with multiprocessing in Python using if __name__ == '__main__'.

I understand that this is necessary in some cases to give access to functions defined in the main but I do not understand why this is necessary in this case:

file2.py

import numpy as np
from multiprocessing import Pool
class Something(object):
    def get_image(self):
        return np.random.rand(64,64)

    def mp(self):
        image = self.get_image()
        p = Pool(2)
        res1 = p.apply_async(np.sum, (image,))
        res2 = p.apply_async(np.mean, (image,))
        print(res1.get())
        print(res2.get())
        p.close()
        p.join()

main.py

from file2 import Something
s = Something()
s.mp()

All of the functions or imports necessary for Something to work are part of file2.py. Why does the subprocess need to re-run the main.py?

I think the __name__ solution is not very nice as this prevents me from distribution the code of file2.py as I can't make sure they are protecting their main. Isn't there a workaround for Windows? How are packages solving that (as I never encountered any problem not protecting my main with any package - are they just not using multiprocessing?)

edit: I know that this is because of the fork() not implemented in Windows. I was just asking if there is a hack to let the interpreter start at file2.py instead of main.py as I can be sure that file2.py is self-sufficient

like image 1000
skjerns Avatar asked Jul 14 '17 19:07

skjerns


3 Answers

As others have mentioned the spawn() method on Windows will re-import the code for each instance of the interpreter. This import will execute your code again in the child process (and this will make it create it own child, and so on).

A workaround is to pull the multiprocessing script into a separate file and then use subprocess to launch it from the main script.

I pass variables into the script by pickling them in a temporary directory, and I pass the temporary directory into the subprocess with argparse.

I then pickle the results into the temporary directory, where the main script retrieves them.

Here is an example file_hasher() function that I wrote:

main_program.py

import os, pickle, shutil, subprocess, sys, tempfile

def file_hasher(filenames):
    try:
        subprocess_directory = tempfile.mkdtemp()
        input_arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
        with open(input_arguments_file, 'wb') as func_inputs:
            pickle.dump(filenames, func_inputs)
        current_path = os.path.dirname(os.path.realpath(__file__))
        file_hasher = os.path.join(current_path, 'file_hasher.py')
        python_interpreter = sys.executable
        proc = subprocess.call([python_interpreter, file_hasher, subprocess_directory],
                               timeout=60, 
                              )
        output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
        with open(output_file, 'rb') as func_outputs:
            hashlist = pickle.load(func_outputs)
    finally:
        shutil.rmtree(subprocess_directory)
    return hashlist

file_hasher.py

#! /usr/bin/env python
import argparse, hashlib, os, pickle
from multiprocessing import Pool

def file_hasher(input_file):
    with open(input_file, 'rb') as f:
        data = f.read()
        md5_hash = hashlib.md5(data)
    hashval = md5_hash.hexdigest()
    return hashval

if __name__=='__main__':
    argument_parser = argparse.ArgumentParser()
    argument_parser.add_argument('subprocess_directory', type=str)
    subprocess_directory = argument_parser.parse_args().subprocess_directory

    arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
    with open(arguments_file, 'rb') as func_inputs:
        filenames = pickle.load(func_inputs)

    hashlist = []
    p = Pool()
    for r in p.imap(file_hasher, filenames):
        hashlist.append(r)

    output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
    with open(output_file, 'wb') as func_outputs:
        pickle.dump(hashlist, func_outputs)

There must be a better way...

like image 55
Chris Hubley Avatar answered Oct 15 '22 00:10

Chris Hubley


When using the "spawn" start method, new processes are Python interpreters that are started from scratch. It's not possible for the new Python interpreters in the subprocesses to figure out what modules need to be imported, so they import the main module again, which in turn will import everything else. This means it must be possible to import the main module without any side effects.

If you are on a different platform than Windows, you can use the "fork" start method instead, and you won't have this problem.

That said, what's wrong with using if __name__ == "__main__":? It has a lot of additional benefits, e.g. documentation tools will be able to process your main module, and unit testing is easier etc, so you should use it in any case.

like image 6
Sven Marnach Avatar answered Oct 15 '22 01:10

Sven Marnach


The main module is imported (but with __name__ != '__main__' because Windows is trying to simulate a forking-like behavior on a system that doesn't have forking). multiprocessing has no way to know that you didn't do anything important in you main module, so the import is done "just in case" to create an environment similar to the one in your main process. If it didn't do this, all sorts of stuff that happens by side-effect in main (e.g. imports, configuration calls with persistent side-effects, etc.) might not be properly performed in the child processes.

As such, if they're not protecting their __main__, the code is not multiprocessing safe (nor is it unittest safe, import safe, etc.). The if __name__ == '__main__': protective wrapper should be part of all correct main modules. Go ahead and distribute it, with a note about requiring multiprocessing-safe main module protection.

like image 2
ShadowRanger Avatar answered Oct 15 '22 00:10

ShadowRanger