Is there a simple way to track the overall progress of a joblib.Parallel execution?
I have a long-running execution composed of thousands of jobs, which I want to track and record in a database. However, to do that, whenever Parallel finishes a task, I need it to execute a callback, reporting how many remaining jobs are left.
I've accomplished a similar task before with Python's stdlib multiprocessing.Pool, by launching a thread that records the number of pending jobs in Pool's job list.
Looking at the code, Parallel inherits Pool, so I thought I could pull off the same trick, but it doesn't seem to use these that list, and I haven't been able to figure out how else to "read" it's internal status any other way.
TL;DR - it preserves order for both backends.
Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy. memmap subclass of numpy. ndarray . This makes it possible to share a segment of data between all the worker processes.
Parameters n_jobs: int, default: NoneThe maximum number of concurrently running jobs, such as the number of Python worker processes when backend=”multiprocessing” or the size of the thread-pool when backend=”threading”.
For most problems, parallel computing can really increase the computing speed. As the increase of PC computing power, we can simply increase our computing by running parallel code in our own PC.
Yet another step ahead from dano's and Connor's answers is to wrap the whole thing as a context manager:
import contextlib import joblib from tqdm import tqdm @contextlib.contextmanager def tqdm_joblib(tqdm_object): """Context manager to patch joblib to report into tqdm progress bar given as argument""" class TqdmBatchCompletionCallback(joblib.parallel.BatchCompletionCallBack): def __call__(self, *args, **kwargs): tqdm_object.update(n=self.batch_size) return super().__call__(*args, **kwargs) old_batch_callback = joblib.parallel.BatchCompletionCallBack joblib.parallel.BatchCompletionCallBack = TqdmBatchCompletionCallback try: yield tqdm_object finally: joblib.parallel.BatchCompletionCallBack = old_batch_callback tqdm_object.close()
Then you can use it like this and don't leave monkey patched code once you're done:
from joblib import Parallel, delayed with tqdm_joblib(tqdm(desc="My calculation", total=10)) as progress_bar: Parallel(n_jobs=16)(delayed(sqrt)(i**2) for i in range(10))
which is awesome I think and it looks similar to tqdm pandas integration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With