How does joblib.Parallel deal with global variables?

Tags:

My code looks something like this:

from joblib import Parallel, delayed

# prediction model - 10s of megabytes on disk
LARGE_MODEL = load_model('path/to/model')

file_paths = glob('path/to/files/*')

def do_thing(file_path):
  pred = LARGE_MODEL.predict(load_image(file_path))
  return pred

Parallel(n_jobs=2)(delayed(do_thing)(fp) for fp in file_paths)

My question is whether LARGE_MODEL will be pickled/unpickled with each iteration of the loop. And if so, how can I make sure each worker caches it instead (if that's possible)?

240

asked Nov 04 '20 12:11

Alexander Soare

1 Answers

TLDR

The parent process pickles large model once. That can be made more performant by ensuring large model is a numpy array backed to a memfile. Workers can load_temporary_memmap much faster than from disk.

Your job is parallelized and likely to be using joblibs._parallel_backends.LokyBackend.

In joblib.parallel.Parallel.__call__, joblib tries to initialize the backend to use LokyBackend when n_jobs is set to a count greater than 1.

LokyBackend uses a shared temporary folder for the same Parallel object. This is relevant for reducers that modify the default pickling behavior.

Now, LokyBackend configures a MemmappingExecutor that shares this folder to the reducers.

If you have numpy installed and your model is a clean numpy array, you are guaranteed to have it pickled once as a memmapped file using the ArrayMemmapForwardReducer and passed from parent to child processes.

Otherwise it is pickled using the default pickling as a bytes object.

You can know how your model is pickled in the parent process reading the debug logs from joblib.

Each worker 'unpickles' large model so there is really no point in caching the large model there.

You can only improve the source from where the pickled large model is loaded from in the workers by backing your models as a memory mapped file.

107

answered Sep 22 '22 02:09

Oluwafemi Sule

Related questions
                            
                                To what extent does Google Colab support Python typing?
                            
                                Python Turtle Write Value in Containing Box
                            
                                What form of imports should I use in __main__.py and then how should I run the project?
                            
                                Keras loss and metrics values do not match with same function in each
                            
                                Fill Box Color in Box Plot
                            
                                ERROR: Unable to find py4j, your SPARK_HOME may not be configured correctly
                            
                                TypeError: required field "type_ignores" missing from Module
                            
                                Infinite scroll bar is not working with django
                            
                                Plotting networkx.Graph: how to change node position instead of resetting every node?
                            
                                What is the correct boilerplate for explicit relative imports?
                            
                                Python concurrent.futures Error in atexit._run_exitfuncs: OSError: handle is closed only running in Visual studio Debugging Mode
                            
                                Scrapy hidden memory leak
                            
                                How to convert a dataframe from long to wide, with values grouped by year in the index?
                            
                                How to specify external system dependencies to a Python package?
                            
                                creating a json object from pandas dataframe
                            
                                Decrypting AES CBC in python from OpenSSL AES
                            
                                Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)
                            
                                How to display two figures, side by side, in a Jupyter cell
                            
                                Early stopping with multiple conditions
                            
                                how to solve bug on snake wall teleportation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does joblib.Parallel deal with global variables?

Tags:

python

parallel-processing

joblib

Alexander Soare

People also ask

1 Answers

Oluwafemi Sule

Recent Activity

Donate For Us