I have a pipeline defined in AWS Step Functions. One step is defined as Fargate Task, which pulls a docker image and run some python code. I've surprisingly found that if the container running in the Fargate task encounters a runtime error the Step Functions doesn't catch the failed task and continue the pipeline as normal (setting the Fargate task as successful), but according to the documentation the pipeline should fail as soon as this happens.
This is the step function definition:
{
"Comment": "My state machine",
"StartAt": "MyFargateTask",
"States": {
"MyFargateTask": {
"Type": "Task",
"Resource": "arn:aws:states:::ecs:runTask.sync",
"InputPath": "$",
"Parameters": {
"Cluster": "my-cluster",
"TaskDefinition": "arn:aws:ecs:us-east-1:617090640476:task-definition/my-task:1",
"LaunchType": "FARGATE",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": [
"subnet-xxxxxxxxxxxxxxxxx",
"subnet-yyyyyyyyyyyyyyyyy"
],
"AssignPublicIp": "ENABLED"
}
},
},
"Next": "Done"
},
"Done": {
"Type": "Succeed"
}
}
}
I've tried the following simple python code for the Fargate container:
def main():
raise Exception("foobar")
if __name__ == '__main__':
main()
In the container logs on CloudWatch I can see the program failing as expected, but the pipeline in the Step Function succeeds (all green). What am I missing? Is this a bug?
You can make use of SQS triggered lambda function which will trigger fargate task.
Since your containers are always running, there is no warmup time caused by Fargate. ECS Tasks can also be configured to run on a schedule or as the result of CloudWatch events.
With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers. On the other hand, Docker Datacenter is detailed as "Develop and manage apps at any scale".
Data Out. ECS or Fargate tasks are unable to return output values to the calling Step Functions.
Task failures will happen due to an exception within a Lambda function. Transient problems caused by network partition events. AWS Step Functions will cause a complete execution failure by default if a state reports an error. In the majority of AWS Step Functions states, you can select a Catch section that will allow you to handle occurring errors.
My Amazon Elastic Container Service (Amazon ECS) tasks on AWS Fargate stop unexpectedly. Your tasks can stop when your Amazon ECS container exits due to application issues, resource constraints, or other issues.
By default, when a state reports an error, AWS Step Functions causes the execution to fail entirely. Step Functions identifies errors in the Amazon States Language using case-sensitive strings, known as error names. The Amazon States Language defines a set of built-in strings that name well-known errors, all beginning with the States. prefix.
Supported Amazon ECS/Fargate APIs and syntax: Parameters in Step Functions are expressed in PascalCase, even when the native service API is camelCase. RunTask starts a new task using the specified task definition. For the Overrides parameter, Step Functions does not support executionRoleArn or taskRoleArn as ContainerOverrides .
AWS Step Functions does not know if an ECS job has succeeded or failed. Step Functions would need to peek into the ECS job's container log, and try to determine if the process running inside the Docker container exited with a failure code. That's not something Step Functions does. As you have it configured, Step Functions simply assumes that whenever the container exists the task has succeeded.
If you change arn:aws:states:::ecs:runTask.sync
to arn:aws:states:::ecs:runTask.waitForTaskToken
then instead of just waiting for the ECS container to exit, Step Fuctions will wait for the ECS container to send a success or failure code back to the Step Functions API. You will also need to pass the task token into the ECS container, which can be done with a ContainerOverrides
setting, like so:
{
"Comment": "My state machine",
"StartAt": "MyFargateTask",
"States": {
"MyFargateTask": {
"Type": "Task",
"Resource": "arn:aws:states:::ecs:runTask.waitForTaskToken",
"InputPath": "$",
"Parameters": {
"Cluster": "my-cluster",
"TaskDefinition": "arn:aws:ecs:us-east-1:617090640476:task-definition/my-task:1",
"LaunchType": "FARGATE",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": [
"subnet-xxxxxxxxxxxxxxxxx",
"subnet-yyyyyyyyyyyyyyyyy"
],
"AssignPublicIp": "ENABLED"
}
},
"Overrides": {
"ContainerOverrides": [{
"Environment": [{
"Name": "TASK_TOKEN",
"Value.$": "$$.Task.Token"
}]
}]
}
},
"Next": "Done"
},
"Done": {
"Type": "Succeed"
}
}
}
Now inside your Python script you can grab the TASK_TOKEN
environment variable, and issue a success or failure message back to Step Functions like so:
token = os.environ['TASK_TOKEN']
def step_success():
if token is not None:
stfn = boto3.client('stepfunctions')
stfn.send_task_success(taskToken=token, output='{"Status": "Success"}')
def step_fail():
if token is not None:
stfn = boto3.client('stepfunctions')
stfn.send_task_failure(taskToken=token, error="An error occurred")
More details on this approach
I recommend also configuring a timeout in the state machine in case your Python script fails to execute within the container or something. Also, you will need to add the appropriate IAM permissions to the Fargate task's IAM role to allow it to issues these status calls back to the Step Functions API.
I think using arn:aws:states:::ecs:runTask.sync will make your stepfunction fail, Because ECS will show container status has STOPPED and detail message as " Essential container in task exited"
If you use arn:aws:states:::ecs:runTask (without .syn) stepfunction will not bother about ECS and will result in Success.
Given your case, you should use arn:aws:states:::ecs:runTask.waitForTaskToken and send sendTaskSuccess or sendTaskFailure with provide token from StepFunction
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With