I am parallelising some code using dask.distributed (embarrassingly parallel task).
.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=2, threads_per_worker=1,memory_limit =8e9)
client = Client(cluster)
.
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.
Perhaps some other process is leaking memory? Process memory: 6.21 GB -- Worker memory limit: 8.00 GB
suggesting that part of the RAM used by the worker is non freed between the different files (I guess are leftover filtering intermediates....)
Is there a way to free the memory of the worker before starting the processing of the next image? do I have to run a garbage collector cycle in between running tasks?
I included gc.collect() call at the end of the function run by the worker but didn't eliminate the warnings.
Thanks a lot for the help!
As long as the reference count for a distributed value is held by a client the cluster won't purge it from memory. This is expounded on in the Managing Memory documentation, specifically the "Clearing data" section.
The "Memory use is high" error message could be pointing to a few potential culprits. I found this article by one of the core Dask maintainers helpful in diagnosing and fixing the issue.
Quick summary, either:
malloc_trim (esp. if working with non-NumPy data or small NumPy chunks)Make sure you can see the Dask Dashboard while your computations are running to figure out which approach is working.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With