Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Step Functions and Fargate task: container runtime error does not cause pipeline to fail

I have a pipeline defined in AWS Step Functions. One step is defined as Fargate Task, which pulls a docker image and run some python code. I've surprisingly found that if the container running in the Fargate task encounters a runtime error the Step Functions doesn't catch the failed task and continue the pipeline as normal (setting the Fargate task as successful), but according to the documentation the pipeline should fail as soon as this happens.

This is the step function definition:

{
  "Comment": "My state machine",
  "StartAt": "MyFargateTask",
  "States": {
    "MyFargateTask": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ecs:runTask.sync",
      "InputPath": "$",
      "Parameters": {
        "Cluster": "my-cluster",
        "TaskDefinition": "arn:aws:ecs:us-east-1:617090640476:task-definition/my-task:1",
        "LaunchType": "FARGATE",
        "NetworkConfiguration": {
          "AwsvpcConfiguration": {
            "Subnets": [
              "subnet-xxxxxxxxxxxxxxxxx",
              "subnet-yyyyyyyyyyyyyyyyy"
            ],
            "AssignPublicIp": "ENABLED"
          }
        },
      },
      "Next": "Done"
    },
    "Done": {
      "Type": "Succeed"
    }
  }
}

I've tried the following simple python code for the Fargate container:

def main():
    raise Exception("foobar")

if __name__ == '__main__':
    main()

In the container logs on CloudWatch I can see the program failing as expected, but the pipeline in the Step Function succeeds (all green). What am I missing? Is this a bug?

like image 940
revy Avatar asked Sep 09 '20 15:09

revy


People also ask

Can SQS trigger fargate?

You can make use of SQS triggered lambda function which will trigger fargate task.

Does fargate run all the time?

Since your containers are always running, there is no warmup time caused by Fargate. ECS Tasks can also be configured to run on a schedule or as the result of CloudWatch events.

What is the difference between fargate and Docker?

With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers. On the other hand, Docker Datacenter is detailed as "Develop and manage apps at any scale".

Can ECS Task return value?

Data Out. ECS or Fargate tasks are unable to return output values to the calling Step Functions.

Why did my task fail in AWS?

Task failures will happen due to an exception within a Lambda function. Transient problems caused by network partition events. AWS Step Functions will cause a complete execution failure by default if a state reports an error. In the majority of AWS Step Functions states, you can select a Catch section that will allow you to handle occurring errors.

Why do my Amazon ECS tasks on AWS Fargate stop unexpectedly?

My Amazon Elastic Container Service (Amazon ECS) tasks on AWS Fargate stop unexpectedly. Your tasks can stop when your Amazon ECS container exits due to application issues, resource constraints, or other issues.

How does AWS step functions handle errors in the States language?

By default, when a state reports an error, AWS Step Functions causes the execution to fail entirely. Step Functions identifies errors in the Amazon States Language using case-sensitive strings, known as error names. The Amazon States Language defines a set of built-in strings that name well-known errors, all beginning with the States. prefix.

What are the supported Amazon ECS/Fargate APIs and syntax?

Supported Amazon ECS/Fargate APIs and syntax: Parameters in Step Functions are expressed in PascalCase, even when the native service API is camelCase. RunTask starts a new task using the specified task definition. For the Overrides parameter, Step Functions does not support executionRoleArn or taskRoleArn as ContainerOverrides .


2 Answers

AWS Step Functions does not know if an ECS job has succeeded or failed. Step Functions would need to peek into the ECS job's container log, and try to determine if the process running inside the Docker container exited with a failure code. That's not something Step Functions does. As you have it configured, Step Functions simply assumes that whenever the container exists the task has succeeded.

If you change arn:aws:states:::ecs:runTask.sync to arn:aws:states:::ecs:runTask.waitForTaskToken then instead of just waiting for the ECS container to exit, Step Fuctions will wait for the ECS container to send a success or failure code back to the Step Functions API. You will also need to pass the task token into the ECS container, which can be done with a ContainerOverrides setting, like so:

{
  "Comment": "My state machine",
  "StartAt": "MyFargateTask",
  "States": {
    "MyFargateTask": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ecs:runTask.waitForTaskToken",
      "InputPath": "$",
      "Parameters": {
        "Cluster": "my-cluster",
        "TaskDefinition": "arn:aws:ecs:us-east-1:617090640476:task-definition/my-task:1",
        "LaunchType": "FARGATE",
        "NetworkConfiguration": {
          "AwsvpcConfiguration": {
            "Subnets": [
              "subnet-xxxxxxxxxxxxxxxxx",
              "subnet-yyyyyyyyyyyyyyyyy"
            ],
            "AssignPublicIp": "ENABLED"
          }
        },
        "Overrides": {
          "ContainerOverrides": [{
            "Environment": [{
              "Name": "TASK_TOKEN",
              "Value.$": "$$.Task.Token"
              }]
          }]
        }
      },
      "Next": "Done"
    },
    "Done": {
      "Type": "Succeed"
    }
  }
}

Now inside your Python script you can grab the TASK_TOKEN environment variable, and issue a success or failure message back to Step Functions like so:

token = os.environ['TASK_TOKEN']

def step_success():
    if token is not None:
        stfn = boto3.client('stepfunctions')
        stfn.send_task_success(taskToken=token, output='{"Status": "Success"}')


def step_fail():
    if token is not None:
        stfn = boto3.client('stepfunctions')
        stfn.send_task_failure(taskToken=token, error="An error occurred")

More details on this approach

I recommend also configuring a timeout in the state machine in case your Python script fails to execute within the container or something. Also, you will need to add the appropriate IAM permissions to the Fargate task's IAM role to allow it to issues these status calls back to the Step Functions API.

like image 65
Mark B Avatar answered Oct 24 '22 09:10

Mark B


I think using arn:aws:states:::ecs:runTask.sync will make your stepfunction fail, Because ECS will show container status has STOPPED and detail message as " Essential container in task exited"

If you use arn:aws:states:::ecs:runTask (without .syn) stepfunction will not bother about ECS and will result in Success.

Given your case, you should use arn:aws:states:::ecs:runTask.waitForTaskToken and send sendTaskSuccess or sendTaskFailure with provide token from StepFunction

like image 37
speedysinghs. Avatar answered Oct 24 '22 09:10

speedysinghs.