Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask distributed workers stall when reaching 80% of memory limit

I am having troubles with memory leaks on Dask workers. Everytime one of the workers reach 80% of their memory limit, they stall and do not compute anything any more:

Three workers stalling at 6.0GB memory usage Here you can see four panels "Bytes stored", "Task stream", "Progress" and "Task processing". The "Bytes stored" panel shows the amount of memory occupied (x-axis) by each of the workers (y-axis). The "Task Stream" panel is a visualization of the threads (y-axis) and the runtime needed to process a task (x-axis). Note that every worker has two threads. The "Task Processing" panel shows a visualization of the task distribution across workers. Dask balances the amount of work to do, i.e. it makes sure that workers always have similar amounts of tasks to process. The "Progress" panel simply shows the processing stages and how much of the stages' tasks are already completed/in memory/waiting to be computed.

Memory and cpu profile of workers This is a simple top-like overview of the workers and their memory limits, etc. As you can see, workers 1, 2 and 3 have low CPU usage (~ 5%) and store 6GB of memory. I.e. they hit their memory limit of 80% and do not accept any new tasks.

Setting lifetime="20 mintues", lifetime_restart=True helps as it restarts the worker s from time to time. However, when a worker reaches the memory limit very fast, it just stalls for ~ 20min until it gets restarted.

Is there some better way to restart workers earlier? I do not want to lower the lifetime too much since long-running tasks might not be able to finish then.

The best strategy would be IMHO the following:

  1. Worker completes (long-running) task
  2. Worker checks whether size of stored items << total memory usage
  3. Worker gracefully restarts itself
like image 201
Hoeze Avatar asked Sep 01 '25 03:09

Hoeze


1 Answers

The policy that you're looking for is described here: https://distributed.dask.org/en/latest/worker.html#memory-management

You can remove the 80% freeze limit and cause things to restart more quickly by changing configuration. These configuration values are documented here: https://docs.dask.org/en/latest/configuration-reference.html#distributed.worker.memory.target

like image 109
MRocklin Avatar answered Sep 02 '25 16:09

MRocklin