Difference between dask.distributed LocalCluster with threads vs. processes

Tags:

What is the difference between the following LocalCluster configurations for dask.distributed?

Client(n_workers=4, processes=False, threads_per_worker=1)

versus

Client(n_workers=1, processes=True, threads_per_worker=4)

They both have four threads working on the task graph, but the first has four workers. What, then, would be the benefit of having multiple workers acting as threads as opposed to a single worker with multiple threads?

Edit: just a clarification, I'm aware of the difference between processes, threads and shared memory, so this question is oriented more towards the configurational differences of these two Clients.

774

asked Sep 02 '19 16:09

jrinker

1 Answers

I was inspired by both Victor and Martin's answers to dig a little deeper, so here's an in-depth summary of my understanding. (couldn't do it in a comment)

First, note that the scheduler printout in this version of dask isn't quite intuitive. processes is actually the number of workers, cores is actually the total number of threads in all workers.

Secondly, Victor's comments about the TCP address and adding/connecting more workers are good to point out. I'm not sure if more workers could be added to a cluster with processes=False, but I think the answer is probably yes.

Now, consider the following script:

from dask.distributed import Client

if __name__ == '__main__':
    with Client(processes=False) as client:  # Config 1
        print(client)
    with Client(processes=False, n_workers=4) as client:  # Config 2
        print(client)
    with Client(processes=False, n_workers=3) as client:  # Config 3
        print(client)
    with Client(processes=True) as client:  # Config 4
        print(client)
    with Client(processes=True, n_workers=3) as client:  # Config 5
        print(client)
    with Client(processes=True, n_workers=3,
                threads_per_worker=1) as client:  # Config 6
        print(client)

This produces the following output in dask version 2.3.0 for my laptop (4 cores):

<Client: scheduler='inproc://90.147.106.86/14980/1' processes=1 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/9' processes=4 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/26' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51744' processes=4 cores=4>
<Client: scheduler='tcp://127.0.0.1:51788' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51818' processes=3 cores=3>

Here's my understanding of the differences between the configurations:

The scheduler and all workers are run as threads within the Client process. (As Martin said, this is useful for introspection.) Because neither the number of workers or the number of threads/worker is given, dask calls its function nprocesses_nthreads() to set the defaults (with processes=False, 1 process and threads equal to available cores).
Same as 1, but since n_workers was given, the threads/workers is chosen by dask such that the total number of threads is equal to the number of cores (i.e., 1). Again, processes in the print output is not exactly correct -- it's actually the number of workers (which in this case are actually threads).
Same as 2, but since n_workers doesn't divide equally into the number of cores, dask chooses 2 threads/worker to overcommit instead of undercommit.
The Client, Scheduler and all workers are separate processes. Dask chooses the default number of workers (equal to cores because it's <= 4) and the default number of threads/worker (1).
Same processes/thread configuration as 5, but the total threads are overprescribed for the same reason as 3.
This behaves as expected.

answered Nov 05 '22 14:11

jrinker

Related questions
                            
                                How to get ISO8601 string for datetime with milliseconds instead of microseconds in python 3.5
                            
                                RabbitMQ pika.exceptions.ConnectionClosed (-1, "error(104, 'Connection reset by peer')")
                            
                                Dataclass subclass does not inherit __repr__
                            
                                Fundamental understanding of tvecs rvecs in OpenCV-ArUco
                            
                                Unknown string format on pd.to_datetime
                            
                                Django DateTimeField says 'You are 5.5 hours ahead of server time.'
                            
                                Create MultiIndex pandas DataFrame from dictionary with tuple keys
                            
                                Expand pandas dataframe column of dict into dataframe columns [duplicate]
                            
                                ModuleNotFoundError: No module named 'google.cloud'
                            
                                Controlling Bin Widths in Altair
                            
                                How to efficiently group pairs based on shared item?
                            
                                Detect whether current shell is powershell in python
                            
                                Groupby Apply Custom Function Pandas
                            
                                removing stop words using spacy
                            
                                How should I use parse_mode='HTML' in telegram python bot?
                            
                                FutureWarning: Method .ptp
                            
                                Replace value with the value of nearest neighbor in Pandas dataframe
                            
                                On fit_generator() / fit() and thread-safety
                            
                                Cannot install googleapiclient on PyCharm
                            
                                How to fix 'ValueError: Empty Training Data' error in tensorflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between dask.distributed LocalCluster with threads vs. processes

Tags:

python

dask

dask-distributed

jrinker

People also ask

1 Answers

jrinker

Recent Activity

Donate For Us