Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Auto Scale Fargate Service Based On SQS ApproximateNumberOfMessagesVisible

I would like to scale out my aws fargate containers based on the size of the SQS queue. It appears that I can only scale based on the container's CPU or Memory usage. Is there a way to create a policy that would scale out and in based on queue size? Has anyone been able to scale based on other cloudwatch metrics?

like image 494
quasar Avatar asked Oct 09 '18 19:10

quasar


People also ask

Does SQS have Auto Scaling?

As you can read in the Scaling Based on Amazon SQS tutorial in the Auto Scaling documentation, you can use the number of messages stored in an SQS queue as an indicator of the amount of work that is waiting in line for eventual processing within an Auto Scaling Group comprised of a variable number of EC2 instances.

Does AWS fargate auto scale?

You can increase or decrease your desired task count by integrating Amazon ECS on Fargate with Amazon CloudWatch alarms and Application Auto Scaling.

Can SQS trigger fargate?

You can make use of SQS triggered lambda function which will trigger fargate task.

What is an appropriate metric for Auto Scaling with SQS in AWS?

The number of instances in your Auto Scaling group can be driven by multiple factors, including how long it takes to process a message and the acceptable amount of latency (queue delay). The solution is to use a backlog per instance metric with the target value being the acceptable backlog per instance to maintain.


2 Answers

Yes you can do this. You have to use a step scaling policy, and you need to have an alarm created already for your SQS queue depth (ApproximateNumberOfMessagesVisible).

Go to CloudWatch, create a new alarm. We'll call this alarm sqs-queue-depth-high, and have it trigger when the approximate number of messages visible is 1000.

With that done, go to ECS to the service you want to autoscale. Click Update for the service. Add a scaling policy and choose the Step Tracking variety. You'll see there's an option to create a new alarm (which only lets you choose between CPU or MemoryUtilization), or use an existing alarm.

Type sqs-queue-depth-high in the "Use existing alarm" field and press enter, you should see a green checkmark that lets you know the name is valid (i.e. the alarm exists). You'll see new dropdowns where you can adjust the step policy now.

This works for any metric alarm and ECS services. If you're going to be trying to scale this setup out, for multiple environments for example, or making it any more sophisticated than 2 steps, do yourself a favor and jump in with CloudFormation or Terraform to help manage it. Nothing is worse than having to adjust a 5-step alarm across 10 services.

like image 68
bluescores Avatar answered Oct 24 '22 06:10

bluescores


AWS provides a solution for scaling based on SQS queue: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html

Main idea

  1. Create a CloudWatch Custom Metric sqs-backlog-per-task using formula: sqs-backlog-per-task = sqs-messages-number / running-task-number.
  2. Create a Target Tracking Scaling Policy based on the backlogPerInstance metric.

Implementation details

Custom Metric

In my case all the infrastructure (Fargate, SQS, and other resources) is described in CloudFormation stack. So for calculating and logging the custom metric I decided to use AWS Lambda function which is also described in CloudFormation stack and deployed together with the entire infrastructure.

Below you can find code snippets for the AWS Lambda function for logging the following custom metrics:

  • sqs-backlog-per-task - used for scaling
  • running-task-number - used for scaling optimization and debugging

AWS Lambda function described in AWS SAM syntax in CloudFormation stack (infrastructure.yml):

CustomMetricLoggerFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: custom-metric-logger
      Handler: custom-metric-logger.handler
      Runtime: nodejs8.10
      MemorySize: 128
      Timeout: 3
      Role: !GetAtt CustomMetricLoggerFunctionRole.Arn
      Environment:
        Variables:
          ECS_CLUSTER_NAME: !Ref Cluster
          ECS_SERVICE_NAME: !GetAtt Service.Name
          SQS_URL: !Ref Queue
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Schedule: 'cron(0/1 * * * ? *)' # every one minute

AWS Lambda Javascript code for calculating and logging (custom-metric-logger.js):

var AWS = require('aws-sdk');

exports.handler = async () => {
  try {
    var sqsMessagesNumber = await getSqsMessagesNumber();
    var runningContainersNumber = await getRunningContainersNumber();

    var backlogPerInstance = sqsMessagesNumber;
    if (runningContainersNumber > 0) {
      backlogPerInstance = parseInt(sqsMessagesNumber / runningContainersNumber);
    }

    await putRunningTaskNumberMetricData(runningContainersNumber);
    await putSqsBacklogPerTaskMetricData(backlogPerInstance);

    return {
      statusCode: 200
    };
  } catch (err) {
    console.log(err);

    return {
      statusCode: 500
    };
  }
};

function getSqsMessagesNumber() {
  return new Promise((resolve, reject) => {
    var data = {
      QueueUrl: process.env.SQS_URL,
      AttributeNames: ['ApproximateNumberOfMessages']
    };

    var sqs = new AWS.SQS();
    sqs.getQueueAttributes(data, (err, data) => {
      if (err) {
        reject(err);
      } else {
        resolve(parseInt(data.Attributes.ApproximateNumberOfMessages));
      }
    });
  });
}

function getRunningContainersNumber() {
  return new Promise((resolve, reject) => {
    var data = {
      services: [
        process.env.ECS_SERVICE_NAME
      ],
      cluster: process.env.ECS_CLUSTER_NAME
    };

    var ecs = new AWS.ECS();
    ecs.describeServices(data, (err, data) => {
      if (err) {
        reject(err);
      } else {
        resolve(data.services[0].runningCount);
      }
    });
  });
}

function putRunningTaskNumberMetricData(value) {
  return new Promise((resolve, reject) => {
    var data = {
      MetricData: [{
        MetricName: 'running-task-number',
        Value: value,
        Unit: 'Count',
        Timestamp: new Date()
      }],
      Namespace: 'fargate-sqs-service'
    };

    var cloudwatch = new AWS.CloudWatch();
    cloudwatch.putMetricData(data, (err, data) => {
      if (err) {
        reject(err);
      } else {
        resolve(data);
      }
    });
  });
}

function putSqsBacklogPerTaskMetricData(value) {
  return new Promise((resolve, reject) => {
    var data = {
      MetricData: [{
        MetricName: 'sqs-backlog-per-task',
        Value: value,
        Unit: 'Count',
        Timestamp: new Date()
      }],
      Namespace: 'fargate-sqs-service'
    };

    var cloudwatch = new AWS.CloudWatch();
    cloudwatch.putMetricData(data, (err, data) => {
      if (err) {
        reject(err);
      } else {
        resolve(data);
      }
    });
  });
}

Target Tracking Scaling Policy

Then based on the sqs-backlog-per-task metric I created Target Tracking Scaling Policy in my Cloud Formation template.

Target Tracking Scaling Policy based on the sqs-backlog-per-task metric (infrastructure.yml):

ServiceScalingPolicy:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: service-scaling-policy
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref ServiceScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        ScaleInCooldown: 60
        ScaleOutCooldown: 60
        CustomizedMetricSpecification:
          Namespace: fargate-sqs-service
          MetricName: sqs-backlog-per-task
          Statistic: Average
          Unit: Count
        TargetValue: 2000

As a result AWS Application Auto Scaling creates and manages the CloudWatch alarms that trigger the scaling policy and calculates the scaling adjustment based on the metric and the target value. The scaling policy adds or removes capacity as required to keep the metric at, or close to, the specified target value. In addition to keeping the metric close to the target value, a target tracking scaling policy also adjusts to changes in the metric due to a changing load pattern.

like image 15
Volodymyr Machula Avatar answered Oct 24 '22 05:10

Volodymyr Machula