Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes dask job failure with CancelledError exception

I have been seeing below error message for quite some time now but could not figure out what leads to the failure.

Error:

concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)

On restart of dask cluster it runs successfully.

like image 288
Santosh Kumar Avatar asked Oct 19 '17 19:10

Santosh Kumar


1 Answers

If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.

The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.

like image 185
elukem Avatar answered Oct 19 '22 16:10

elukem