We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as
service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)
This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.
Upgraded Docker storage driver from devicemapper to overlay2
We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.
We increase health check grace period for the service from 0 to 240secs
Increased KeepAliveTimeout and SocketTimeout to 180 secs
Enabled awslogs on containers instead of stdout, but there was no unusual behavior
Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.
Enabled container insights for better container level debugging
Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.
The amount of errors have come down amazingly with these two but still we are getting this issue once a while.
We have seen all the graphs related to instance and container which went down below are the logs for it:
ECS container insights logs for victim container :
Query :
fields CpuUtilized, MemoryUtilized, @message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"
Example Logs answered :
{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}
None of logs were having CPU and Memory utilised ridiculously high.
We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins
Our health check configurations are below :
Protocol HTTP
Path /healthcheck
Port traffic port
Healthy threshold 10
Unhealthy threshold 2
Timeout 5
Interval 10
Success codes 200
Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :
docker info
Containers: 11
Running: 11
Paused: 0
Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.
Your steps from 1 to 7 almost no thing do with the error.
service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)
The error is very clear, you ECS service is not reachable to Load balancer health check.
Target Group Unhealthy
When this is the case, go straight and check the container SG, Port, application status or health status code.
Possible reason
Path /healthcheck
in the backend service/healthcheck
is not 200
3000
or 8080
These are the possible reason when there is a timeout from health check.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With