AWS Step Functions and Fargate task: container runtime error does not cause pipeline to fail

Tags:

I have a pipeline defined in AWS Step Functions. One step is defined as Fargate Task, which pulls a docker image and run some python code. I've surprisingly found that if the container running in the Fargate task encounters a runtime error the Step Functions doesn't catch the failed task and continue the pipeline as normal (setting the Fargate task as successful), but according to the documentation the pipeline should fail as soon as this happens.

This is the step function definition:

{
  "Comment": "My state machine",
  "StartAt": "MyFargateTask",
  "States": {
    "MyFargateTask": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ecs:runTask.sync",
      "InputPath": "$",
      "Parameters": {
        "Cluster": "my-cluster",
        "TaskDefinition": "arn:aws:ecs:us-east-1:617090640476:task-definition/my-task:1",
        "LaunchType": "FARGATE",
        "NetworkConfiguration": {
          "AwsvpcConfiguration": {
            "Subnets": [
              "subnet-xxxxxxxxxxxxxxxxx",
              "subnet-yyyyyyyyyyyyyyyyy"
            ],
            "AssignPublicIp": "ENABLED"
          }
        },
      },
      "Next": "Done"
    },
    "Done": {
      "Type": "Succeed"
    }
  }
}

I've tried the following simple python code for the Fargate container:

def main():
    raise Exception("foobar")

if __name__ == '__main__':
    main()

In the container logs on CloudWatch I can see the program failing as expected, but the pipeline in the Step Function succeeds (all green). What am I missing? Is this a bug?

940

asked Sep 09 '20 15:09

revy

2 Answers

AWS Step Functions does not know if an ECS job has succeeded or failed. Step Functions would need to peek into the ECS job's container log, and try to determine if the process running inside the Docker container exited with a failure code. That's not something Step Functions does. As you have it configured, Step Functions simply assumes that whenever the container exists the task has succeeded.

If you change arn:aws:states:::ecs:runTask.sync to arn:aws:states:::ecs:runTask.waitForTaskToken then instead of just waiting for the ECS container to exit, Step Fuctions will wait for the ECS container to send a success or failure code back to the Step Functions API. You will also need to pass the task token into the ECS container, which can be done with a ContainerOverrides setting, like so:

{
  "Comment": "My state machine",
  "StartAt": "MyFargateTask",
  "States": {
    "MyFargateTask": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ecs:runTask.waitForTaskToken",
      "InputPath": "$",
      "Parameters": {
        "Cluster": "my-cluster",
        "TaskDefinition": "arn:aws:ecs:us-east-1:617090640476:task-definition/my-task:1",
        "LaunchType": "FARGATE",
        "NetworkConfiguration": {
          "AwsvpcConfiguration": {
            "Subnets": [
              "subnet-xxxxxxxxxxxxxxxxx",
              "subnet-yyyyyyyyyyyyyyyyy"
            ],
            "AssignPublicIp": "ENABLED"
          }
        },
        "Overrides": {
          "ContainerOverrides": [{
            "Environment": [{
              "Name": "TASK_TOKEN",
              "Value.$": "$$.Task.Token"
              }]
          }]
        }
      },
      "Next": "Done"
    },
    "Done": {
      "Type": "Succeed"
    }
  }
}

Now inside your Python script you can grab the TASK_TOKEN environment variable, and issue a success or failure message back to Step Functions like so:

token = os.environ['TASK_TOKEN']

def step_success():
    if token is not None:
        stfn = boto3.client('stepfunctions')
        stfn.send_task_success(taskToken=token, output='{"Status": "Success"}')


def step_fail():
    if token is not None:
        stfn = boto3.client('stepfunctions')
        stfn.send_task_failure(taskToken=token, error="An error occurred")

More details on this approach

I recommend also configuring a timeout in the state machine in case your Python script fails to execute within the container or something. Also, you will need to add the appropriate IAM permissions to the Fargate task's IAM role to allow it to issues these status calls back to the Step Functions API.

answered Oct 24 '22 09:10

Mark B

I think using arn:aws:states:::ecs:runTask.sync will make your stepfunction fail, Because ECS will show container status has STOPPED and detail message as " Essential container in task exited"

If you use arn:aws:states:::ecs:runTask (without .syn) stepfunction will not bother about ECS and will result in Success.

Given your case, you should use arn:aws:states:::ecs:runTask.waitForTaskToken and send sendTaskSuccess or sendTaskFailure with provide token from StepFunction

answered Oct 24 '22 09:10

speedysinghs.

Related questions
                            
                                Angular S3 Static Website - 403 Forbidden Routing Error
                            
                                How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?
                            
                                add S3 trigger on a Lambda function with cloudformation yaml
                            
                                Would someone be able provide an example of what an AWS Cloudformation AWS::GLUE::WORKFLOW template would look like?
                            
                                AWS S3 - SlowDown: Please reduce your request rate
                            
                                mock boto3 response for downloading file from S3
                            
                                When should I delete messages in SQS?
                            
                                AWS Athena - duplicate columns due to partitionning
                            
                                Cognito Custom Message Trigger doesn't have any effect
                            
                                Can I send batch message to AWS SNS
                            
                                Authorise Request to AWS WebSocket API Gateway using AWS_IAM
                            
                                AWS Lambda, Python, Numpy and others as Layers
                            
                                Are Cognito refresh tokens "valid" JSON web tokens?
                            
                                Airflow: Unable to access the AWS providers
                            
                                Testing a AWS S3 Presigned Url returns 403 Forbidden (Nodejs)
                            
                                How to pass parameter as a file in AWS CloudFormation deploy?
                            
                                AWS Cloudformation create resource conditionally
                            
                                AWS question - How can I get Cloudwatch event data in a Fargate task with Python
                            
                                Next.js: How to make links work with exported sites when hosted on AWS Cloudfront?
                            
                                How to set node taints using Terraform for Amazon EKS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Step Functions and Fargate task: container runtime error does not cause pipeline to fail

Tags:

debugging

amazon-web-services

aws-fargate

aws-step-functions

revy

People also ask

2 Answers

Mark B

speedysinghs.

Recent Activity

Donate For Us