Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tracking progress of joblib.Parallel execution

Is there a simple way to track the overall progress of a joblib.Parallel execution?

I have a long-running execution composed of thousands of jobs, which I want to track and record in a database. However, to do that, whenever Parallel finishes a task, I need it to execute a callback, reporting how many remaining jobs are left.

I've accomplished a similar task before with Python's stdlib multiprocessing.Pool, by launching a thread that records the number of pending jobs in Pool's job list.

Looking at the code, Parallel inherits Pool, so I thought I could pull off the same trick, but it doesn't seem to use these that list, and I haven't been able to figure out how else to "read" it's internal status any other way.

like image 583
Cerin Avatar asked Jul 27 '14 17:07

Cerin


People also ask

Does joblib parallel preserve order?

TL;DR - it preserves order for both backends.

How does joblib parallel work?

Parallel provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the numpy. memmap subclass of numpy. ndarray . This makes it possible to share a segment of data between all the worker processes.

What is N_jobs in joblib?

Parameters n_jobs: int, default: NoneThe maximum number of concurrently running jobs, such as the number of Python worker processes when backend=”multiprocessing” or the size of the thread-pool when backend=”threading”.

When should I use joblib parallel?

For most problems, parallel computing can really increase the computing speed. As the increase of PC computing power, we can simply increase our computing by running parallel code in our own PC.


1 Answers

Yet another step ahead from dano's and Connor's answers is to wrap the whole thing as a context manager:

import contextlib import joblib from tqdm import tqdm  @contextlib.contextmanager def tqdm_joblib(tqdm_object):     """Context manager to patch joblib to report into tqdm progress bar given as argument"""     class TqdmBatchCompletionCallback(joblib.parallel.BatchCompletionCallBack):         def __call__(self, *args, **kwargs):             tqdm_object.update(n=self.batch_size)             return super().__call__(*args, **kwargs)      old_batch_callback = joblib.parallel.BatchCompletionCallBack     joblib.parallel.BatchCompletionCallBack = TqdmBatchCompletionCallback     try:         yield tqdm_object     finally:         joblib.parallel.BatchCompletionCallBack = old_batch_callback         tqdm_object.close() 

Then you can use it like this and don't leave monkey patched code once you're done:

from joblib import Parallel, delayed  with tqdm_joblib(tqdm(desc="My calculation", total=10)) as progress_bar:     Parallel(n_jobs=16)(delayed(sqrt)(i**2) for i in range(10)) 

which is awesome I think and it looks similar to tqdm pandas integration.

like image 98
featuredpeow Avatar answered Oct 01 '22 02:10

featuredpeow