We have a service running that orchestrates starting Fargate ECS tasks on messages from a RabbitMQ-queue. Sometimes tasks weirdly fail to start.
Info:
The problem:
Error messages I have gotten the last 36 hours:
'Timeout waiting for network interface provisioning to complete.''ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message''CannotPullContainerError: ref pull has been retried 5 time(s): failed to resolve reference <image that exists in repository>: failed to do request: Head https:<account-id>.dkr.ecr.eu-west-1.amazonaws.com/v2/k1-d...''ResourceInitializationError: failed to configure ENI: failed to setup regular eni: context deadline exceeded'Thoughts:
Questions:
Thanks very much in advance, I have been chasing this problem for months and feel like I am at least closing in on it, but this is as far as I can get on my own, I fear.
It is possible that tasks may fail to start due to a certain amount of reasons. Some of them are transient and are more "AWS" some others are more structural of your configuration and are more "you". For example the network time out is often due to a network misconfiguration where the task ENI does not have a proper route to the registry (e.g. Docker Hub). In all other cases it is possible that it's a transient one-off issue of the Fargate internals.
These problems may be transparent to you OR you may need to take action depending on how you use Fargate. For example, if you use Fargate tasks as part of an ECS service or an EKS deployment, the ECS/EKS routines will make sure they retry to instantiate the task to meet the service/deployment target configuration.
If you are launching the Fargate task using a one-off RunTask API call (i.e. not part of an orchestrator control loop that can monitor its failure) then it depends how you are calling that API. If you are calling it from tools such as AWS Step Functions, AWS Batch and possibly others, they all have retry mechanisms so if a task fails to launch they are smart enough to re-launch it.
However, if you are launching the task from an imperative line of code (or CLI command etc) then it's on your code to make sure the task has been launched properly and that you don't need to re-launch it upon an error message.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With