Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Occasional failure on Amazon ECS with different error messages when starting task

We have a service running that orchestrates starting Fargate ECS tasks on messages from a RabbitMQ-queue. Sometimes tasks weirdly fail to start.

Info:

  • It starts a task somewhere between every other minute and every ten minutes.
  • It uses a set amount of task definitions. It re-uses the task definitions.
  • It consistently uses the same subnet in the same VPC.

The problem:

  • The vast majority of tasks starts fine. Say 98%. Sometimes tasks fail to start, and I get error messages. The error messages are not always the same, but they seem to be network-related.

Error messages I have gotten the last 36 hours:

  • 'Timeout waiting for network interface provisioning to complete.'
  • 'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message'
  • 'CannotPullContainerError: ref pull has been retried 5 time(s): failed to resolve reference <image that exists in repository>: failed to do request: Head https:<account-id>.dkr.ecr.eu-west-1.amazonaws.com/v2/k1-d...'
  • 'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: context deadline exceeded'

Thoughts:

  • It looks to me like there is a network-connectivity error of some sort.
  • The result of my Googling tells me that at least some of the errors can arise from having wrongly configured VPC or route-tables.
  • This is not the case here, I assume, since starting the exact same task with the exact same task definition in the same subnet works fine most of the time.
  • The ENI problem could maybe arise from me running out of ENI:s (?) on an EC2-instance, but since these tasks are started through Fargate I feel like that should not be the problem.
  • It seems like at least the network provisioning error can sometimes be an AWS issue.

Questions:

  • Why is this happening? Is it me or AWS?
  • Depending on the answer to the first question, is there something I can do to avoid this?
  • If there is nothing I can do, is there something I can do to mitigate it while it's happening? Should I simply just retry starting the task and hope that solves it?

Thanks very much in advance, I have been chasing this problem for months and feel like I am at least closing in on it, but this is as far as I can get on my own, I fear.

like image 206
uggl Avatar asked Jan 27 '26 20:01

uggl


1 Answers

It is possible that tasks may fail to start due to a certain amount of reasons. Some of them are transient and are more "AWS" some others are more structural of your configuration and are more "you". For example the network time out is often due to a network misconfiguration where the task ENI does not have a proper route to the registry (e.g. Docker Hub). In all other cases it is possible that it's a transient one-off issue of the Fargate internals.

These problems may be transparent to you OR you may need to take action depending on how you use Fargate. For example, if you use Fargate tasks as part of an ECS service or an EKS deployment, the ECS/EKS routines will make sure they retry to instantiate the task to meet the service/deployment target configuration.

If you are launching the Fargate task using a one-off RunTask API call (i.e. not part of an orchestrator control loop that can monitor its failure) then it depends how you are calling that API. If you are calling it from tools such as AWS Step Functions, AWS Batch and possibly others, they all have retry mechanisms so if a task fails to launch they are smart enough to re-launch it.

However, if you are launching the task from an imperative line of code (or CLI command etc) then it's on your code to make sure the task has been launched properly and that you don't need to re-launch it upon an error message.

like image 109
mreferre Avatar answered Jan 29 '26 12:01

mreferre