Is there a way to limit the number of cores used by the default threaded scheduler (default when using dask dataframes)?
With compute
, you can specify it by using:
df.compute(get=dask.threaded.get, num_workers=20)
But I was wondering if there is a way to set this as the default, so you don't need to specify this for each compute
call?
The would eg be interesting in the case of a small cluster (eg of 64 cores), but which is shared with other people (without a job system), and I don't want to necessarily take up all cores when starting computations with dask.
As explained in the Machine Specifications section, the machine has 4 cores and therefore a maximum of 8 threads/ processes can be run in parallel.
The Dask SchedulerThe Scheduler acts as a middle layer between the client and the workers, instructing workers to execute the actual computations requested by the client. It also helps the workers coordinate with each other, deciding who should do which tasks.
dask. bag uses the multiprocessing scheduler by default.
You can specify a default ThreadPool
from multiprocessing.pool import ThreadPool
import dask
dask.config.set(pool=ThreadPool(20))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With