I have been seeing below error message for quite some time now but could not figure out what leads to the failure.
Error:
concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)
On restart of dask cluster it runs successfully.
If running a dask-cloudprovider
ECSCluster
or FargateCluster
the concurrent.futures._base.CancelledError
can result from a long-running step in computation where there is no output (logging or otherwise) to the Client
. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout
period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.
The solution is to set scheduler_timeout
to a higher value, either via config or by passing directly to the ECSCluster
/FargateCluster
constructor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With