Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS ECS service Tasks getting replaced with (reason Request timed out)

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.

  1. Upgraded Docker storage driver from devicemapper to overlay2

  2. We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.

  3. We increase health check grace period for the service from 0 to 240secs

  4. Increased KeepAliveTimeout and SocketTimeout to 180 secs

  5. Enabled awslogs on containers instead of stdout, but there was no unusual behavior

  6. Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.

  7. Enabled container insights for better container level debugging

Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.

The amount of errors have come down amazingly with these two but still we are getting this issue once a while.

We have seen all the graphs related to instance and container which went down below are the logs for it:

ECS container insights logs for victim container :

Query :

fields CpuUtilized, MemoryUtilized, @message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"

Example Logs answered :

{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}

None of logs were having CPU and Memory utilised ridiculously high.

We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins

Our health check configurations are below :

Protocol HTTP
Path  /healthcheck
Port traffic port
Healthy threshold  10
Unhealthy threshold 2
Timeout  5
Interval 10
Success codes 200

Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :

docker info
Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.

like image 974
mohit3081989 Avatar asked Sep 23 '19 09:09

mohit3081989


1 Answers

Your steps from 1 to 7 almost no thing do with the error.

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

The error is very clear, you ECS service is not reachable to Load balancer health check.

Target Group Unhealthy

When this is the case, go straight and check the container SG, Port, application status or health status code.

Possible reason

  • There might be the case, there is no route Path /healthcheck in the backend service
  • The status code from /healthcheck is not 200
  • Might be the case that target port is invalid, configure it correctly, if an application running on port 8080 or 3000 it should be 3000 or 8080
  • The security group is not allowing traffic on the target group
  • Application is not running in the container

These are the possible reason when there is a timeout from health check.

like image 117
Adiii Avatar answered Oct 03 '22 22:10

Adiii