Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

500 on Google Cloud Run: The request failed because the instance could not start successfully

I'm doing load testing on an ExpressJS app hosted on Google Cloud Run, upon spike increase in traffic, there is a period where I see many 500 errors in Stackdriver with the message "The request failed because the instance could not start successfully." - which effectively leads to server downtime.

Seeing that this error occurs more frequently as the app scales up, I'm thinking this is caused by the Cloud Run load balancer assigning traffic prematurely to new instances, before these instances are ready to accept requests.

As I continue to run the load test, the instances are continuously and repeatedly killed and restarted, so there is no mechanism for recovery while the load is on.

I don't see any error logs from my NodeJS application, suggesting none of the failed requests actually reached my app.

What can I do to avoid these errors?

How does Cloud Run determine that a port is ready to accept requests?

Is it something I misconfigured in my ExpressJS app or can I somehow delay Cloud Run a bit before sending requests to a new instance?

like image 419
Hans Avatar asked Nov 12 '19 03:11

Hans


People also ask

What is an error 500 internal server error?

The HTTP status code 500 is a generic error response. It means that the server encountered an unexpected condition that prevented it from fulfilling the request. This error is usually returned by the server when no other error code is suitable.

What is Cloudrun?

Cloud Run is a managed compute platform that enables you to run containers that are invocable via requests or events. Cloud Run is serverless: it abstracts away all infrastructure management, so you can focus on what matters most — building great applications.

Does not have permission to access apps instance or it may not exist ): The caller does not have permission?

The caller does not have permission to access project This error occurs if the account that you used to deploy your app does not have permission to deploy apps for the current project. To resolve this issue, grant the App Engine Deployer ( roles/appengine. deployer ) role to the account.


1 Answers

This turned out to be caused by a combination of Cloud Run auto-scaling maximum instance limit and Cloud SQL's connection limit.

I was running a small Cloud SQL Postgres instance (3.75 GB / 1 vCPU) which comes with a default connection limit of 100. (https://cloud.google.com/sql/docs/quotas)

By default, Cloud Run assigns a maximum instance count of 1000 for auto-scaling. During the load test, the sudden spike in request count pushed the auto-scaling to create hundreds of instances, which quickly exhausted the Cloud SQL connection limit of 100.

This exact scenario is documented for Cloud SQL: https://cloud.google.com/sql/docs/postgres/connect-run#connection_limits_3 (it would be nice if this is also documented on Cloud Run, it did not immediately occur to me to look for documentation on Cloud SQL when this issue occurred)

The solution is a combination of limiting the maximum instance count on Cloud Run to a number that is tolerable, and adjusting resource allocation / maximum connection limit on Cloud SQL. The exact configuration would obviously depend on the expected level of load.

like image 99
Hans Avatar answered Sep 30 '22 18:09

Hans