Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix intermittent 503 Service Unavailable after idling/redeployments on AWS HTTP API Gateway & Fargate/ECS?

We've got a quite simple setup which causes us major headaches:

  1. HTTP API Gateway with a S3 Integration for our static HTML/JS and a ANY /api/{proxy+} route to a Fargate Service/Tasks accessible via Cloud Map
  2. ECS Cluster with a "API service" using Fargate and a Container Task exposing Port 8080 via awsvpc. No autoscaling. Min healthy: 100%, Max: 200%.
  3. Service discovery using SRV DNS record with TTL 60
  4. The ECS service/tasks is completely bored/idling and always happy to accept requests while logging them.

Problem:

We receive intermittent HTTP 503 Service Unavailable for some of our requests. A new deployment (with task redeployment) increases the rate, but even after 10-15 minutes they still occur intermittently.

In Cloud Watch we see the failing 503 Requests

2020-06-05T14:19:01.810+02:00 xx.117.163.xx - - [05/Jun/2020:12:19:01 +0000] "GET ANY /api/{proxy+} HTTP/1.1" 503 33 Np24bwmwsiasJDQ=

but it seems like they do not reach a living backend instance.

We enabled VPC Flow Logs and it seems that HTTP API Gateway wants to route some requests to stopped tasks even after they've gone long for good (far exceeding 60s).

More puzzling: If we keep the system busy, the rate drops to nearly zero. Otherwise after a longer period of idling the intermittent errors seem to reoccur.

Questions

  1. How can we fix this issue?
  2. Are there options to further pinpoint the root issue?
like image 231
bentolor Avatar asked Jun 05 '20 14:06

bentolor


People also ask

How do I fix Error 503 on AWS?

Verify using the AWS CLI Run the describe-auto-scaling command. Be sure to replace MY-ASG with the name of your Auto Scaling group. Replace AWS-REGION with your specific AWS Region. In the command output, confirm that the target group is listed under TargetGroupARNs.

Is currently unable to handle this request HTTP error 503 AWS?

An HTTP 503 status code (Service Unavailable) typically indicates a performance issue on the origin server. In rare cases, it indicates that CloudFront temporarily can't satisfy a request because of resource constraints at an edge location.


1 Answers

I was facing this issues and solved it by configuring my ALB being internal, instead of internet-facing(regarding the scheme). Hope it may help someone with the same issue.

Context: The environment is API Gateway + ALB(ECS)

Update The first ALB I configured was to manage my backend services. Recently I also did another ALB(to deal with my front-end instances), in this case, I exposed a public IP(instead of just a private one). This could be achieved by changing the scheme to internet-facing, at first I thought this would bring the same problem as I had before, then I figured that it was something pretty simple. I just needed to add a policy to allow traffic from the internet to the ALB I created.

like image 56
xalves Avatar answered Nov 15 '22 19:11

xalves