Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices to manage docker containers in AWS ECS service

Tech Stack

Python (Monolith API) - Flask Framework PostgreSQL

We have deployed docker container as follows

  • Docker image is stored in ECR
  • Docker container is deployed in ECS
  • In total 25 docker container is deployed in 3 R5 large EC2 instances (2 vCPU, 16 GB)
  • 1024/3072 Minimum & Maximum memory is allocated to each container, so each EC2 instance hold 15 containers

We are facing downtime now a days with an issue of OOM (Out of Memory) and then containers within given EC2 instance start moving to another EC2 instance, This happens when due to some reason 3rd EC2 instance is not available, so until 2nd instance is up and running, we are facing downtime for given set of containers.

So want to check if the strategy that we are using is correct one?

We are also now planning to have small EC2 instances holding lesser number of containers, so if issue happens then at least small numbers sites are down instead of all 15 sites are down, are we going in right direction ?

Should we move to Fargate ? What will be the cost implication compared to using ECS ?

It will be great if somebody help me out to get the perfect solution for this kind of issue.

In near future, we will have containers in 100s & may reach to 500s, so we have to decide on best strategy for deployment, failover, high availability.

like image 798
Manish Joisar Avatar asked Feb 03 '21 06:02

Manish Joisar


2 Answers

If you're getting OOM errors it means that your EC2 machines are being over provisioned- you're running too many containers on them. At 15 containers per instance and a maximum of 3072 for each container you're talking about 46GB of possible memory usage on machines that cap out at 16GB. Once enough containers use the memory they're allocated your machines are going to fall over, taking out all the other tasks with it.

So the first thing you can do right now is lower the number of tasks per machine or lower the max memory so the tasks have less memory to use overall. Since you only have 2 CPUs on each machine I would suggest you to tune it so each machine runs two tasks total with memory split between them, making sure other settings (max connections, workers, etc) are raised accordingly.

You also asked about Fargate. My company uses both EC2 and Fargate for our containers, and we have a policy that if there isn't a specific reason to run things on EC2 (such as needing GPUs) we put it on Fargate. While it is a bit more money (not as much with Compute Savings Plans, but still more) the benefits are really nice. It means each task is run separately, reducing the chances of one task taking out a bunch of others. It also means a faster scale up period because we don't have to wait for the EC2 instances to scale up and join the cluster- which is really important if you're using app autoscaling to respond to a sudden influx of traffic.

The biggest benefit to Fargate is decreased complexity, which in turn means our team has less to worry about- the time and stress savings on the devs can be far more valuable then the extra money spent. The simple fact that we never have to worry about things like upgrading the ECS Agent, integrating with Patch Manager for security updates, and that we don't need to cycle machines regularly to replace them with new builds means we can spend time on other parts of our infrastructure instead.

As I mentioned above though there are cases where Fargate isn't appropriate. For us the biggest use case for running on EC2 instances being able to select the GPU types we use for running ML. For this we built our own AWS Machine Image that works with the various GPU instances AWS offers. This is basically the only place where I'm not using Fargate, as those models need the EC2 instance GPUs.

like image 94
Robert Hafner Avatar answered Nov 11 '22 02:11

Robert Hafner


joisar,

We also faced the same thing. So here I can give some info on how I see it.

After reading your specs I can draw some numbers. As you mentioned you are using 3 EC2 of type R5 Large (2 CPU, 16 Memory). This means you have,

Total CPU = 6 GB units and Total Memory = 48 GB Memory

Max Memory specified in your configuration = 3072. Then you have mentioned 25 Container which is deployed over these 3 Instances. [ Not sure how, unless some of the containers have less memory]

First of all, in a single EC2 you can not have more than 5 Containers with these specs. Find Calculations as below:

16 GB = 1024*16 = 16384.

16384/3072 = 5.3 [means 5 Container at most in Single EC2]

But remember you are launching containers in ECS's EC2, EC2 requires its own free space and memory in the system for its operations. But you are NOT are giving much free memory to EC2 as you allocated all the memory to your containers. [I am assuming the worst case when all 5 containers utilizing 3072 MB Memory.] There you are out of memory. You have to decide the max memory number in such a way that EC2 has some free memory for its own operation.

The advantages of reducing max memory are:

  1. There is more space for EC2
  2. You can go to 2 Task Definitions with reduced size for each service in ECS. In this way, you achieved High Availability.

Try to analyze which container uses more memory, allocate more to that and to others, specify less. You have to balance the number of container's memory. That can also be the pain point for many and here comes the Fargate which can save us.

And you also mention that you are planning to change the EC2 Size. Go for Memory Optimised Instances. And yes Fargate can be best, but it comes with a great cost.

Then for High Scalability, define Autoscaling Polices. Also, policies should be in such a way that in Nights we usually have less traffic so you can reduce the number of EC2 Machines in the Cluster. With these, you will save the cost and with the saved cost; you can spend it during Peak Hours on more Availability of EC2 Machines.

In the end, you have to come up with your Numbers and monitor them and yes it's not a one-day process. It is an evolving process.

like image 31
bhavuk bhardwaj Avatar answered Nov 11 '22 02:11

bhavuk bhardwaj