Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does joblib.Parallel deal with global variables?

My code looks something like this:

from joblib import Parallel, delayed

# prediction model - 10s of megabytes on disk
LARGE_MODEL = load_model('path/to/model')

file_paths = glob('path/to/files/*')

def do_thing(file_path):
  pred = LARGE_MODEL.predict(load_image(file_path))
  return pred

Parallel(n_jobs=2)(delayed(do_thing)(fp) for fp in file_paths)

My question is whether LARGE_MODEL will be pickled/unpickled with each iteration of the loop. And if so, how can I make sure each worker caches it instead (if that's possible)?

like image 240
Alexander Soare Avatar asked Nov 04 '20 12:11

Alexander Soare


People also ask

Does joblib parallel preserve order?

TL;DR - it preserves order for both backends.

What is joblib parallel?

joblib is basically a wrapper library that uses other libraries for running code in parallel. It also lets us choose between multi-threading and multi-processing. joblib is ideal for a situation where you have loops and each iteration through loop calls some function that can take time to complete.

What does joblib delayed do?

The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax. Under Windows, the use of multiprocessing. Pool requires to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.

Why is joblib used?

Joblib is a set of tools to provide lightweight pipelining in Python. In particular: transparent disk-caching of functions and lazy re-evaluation (memoize pattern) easy simple parallel computing.


1 Answers

TLDR

The parent process pickles large model once. That can be made more performant by ensuring large model is a numpy array backed to a memfile. Workers can load_temporary_memmap much faster than from disk.

Your job is parallelized and likely to be using joblibs._parallel_backends.LokyBackend.

In joblib.parallel.Parallel.__call__, joblib tries to initialize the backend to use LokyBackend when n_jobs is set to a count greater than 1.

LokyBackend uses a shared temporary folder for the same Parallel object. This is relevant for reducers that modify the default pickling behavior.

Now, LokyBackend configures a MemmappingExecutor that shares this folder to the reducers.

If you have numpy installed and your model is a clean numpy array, you are guaranteed to have it pickled once as a memmapped file using the ArrayMemmapForwardReducer and passed from parent to child processes.

Otherwise it is pickled using the default pickling as a bytes object.

You can know how your model is pickled in the parent process reading the debug logs from joblib.

Each worker 'unpickles' large model so there is really no point in caching the large model there.

You can only improve the source from where the pickled large model is loaded from in the workers by backing your models as a memory mapped file.

like image 107
Oluwafemi Sule Avatar answered Sep 22 '22 02:09

Oluwafemi Sule