Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify the number of threads/processes for the default dask scheduler

Tags:

python

dask

Is there a way to limit the number of cores used by the default threaded scheduler (default when using dask dataframes)?

With compute, you can specify it by using:

df.compute(get=dask.threaded.get, num_workers=20)

But I was wondering if there is a way to set this as the default, so you don't need to specify this for each compute call?

The would eg be interesting in the case of a small cluster (eg of 64 cores), but which is shared with other people (without a job system), and I don't want to necessarily take up all cores when starting computations with dask.

like image 548
joris Avatar asked Nov 15 '16 23:11

joris


People also ask

How many threads does Dask use?

As explained in the Machine Specifications section, the machine has 4 cores and therefore a maximum of 8 threads/ processes can be run in parallel.

How does Dask scheduler work?

The Dask SchedulerThe Scheduler acts as a middle layer between the client and the workers, instructing workers to execute the actual computations requested by the client. It also helps the workers coordinate with each other, deciding who should do which tasks.

Does Dask use multiprocessing?

dask. bag uses the multiprocessing scheduler by default.


1 Answers

You can specify a default ThreadPool

from multiprocessing.pool import ThreadPool
import dask
dask.config.set(pool=ThreadPool(20))
like image 140
MRocklin Avatar answered Sep 24 '22 17:09

MRocklin