What causes dask job failure with CancelledError exception

Question

I have been seeing below error message for quite some time now but could not figure out what leads to the failure.

Error:

concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)

On restart of dask cluster it runs successfully.

elukem · Accepted Answer

If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.

The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.

What causes dask job failure with CancelledError exception

Tags:

dask

dask-distributed

Santosh Kumar

1 Answers

elukem

Recent Activity

Donate For Us

What causes dask job failure with CancelledError exception

Tags:

dask

dask-distributed

Santosh Kumar

1 Answers

elukem

Related questions

Recent Activity

Donate For Us