I have a dask cluster with n workers and want the workers to do queries to the database. But the database is only capable of handling m queries in parallel where m < n. How can I model that in dask.distributed? Only m workers should work on such a task in parallel.
I have seen that distributed supports locks (http://distributed.readthedocs.io/en/latest/api.html#distributed.Lock). But with that, I could do only one query in parallel, not m.
Also I have seen that I could define resources per worker (https://distributed.readthedocs.io/en/latest/resources.html). But that does not fit also, as the database is independent from the workers. I would either have to define 1 database resource per worker (which leads to too much parallel queries). Or I would have to distribute m database resources to n workers, which is difficult on setting up the cluster and suboptimal in execution.
Is it possible to define something like semaphores in dask to solve that?
You could probably hack something together with Locks and Variables.
A cleaner solution would be to just implement Semaphores much like how Locks are implemented. Depending on your experience this may not be that hard, (the lock implementation is 150 lines) and would be a welcome pull request.
https://github.com/dask/distributed/blob/master/distributed/lock.py
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With