I'm doing load testing on an ExpressJS app hosted on Google Cloud Run, upon spike increase in traffic, there is a period where I see many 500 errors in Stackdriver with the message "The request failed because the instance could not start successfully." - which effectively leads to server downtime. Seeing that this error occurs more frequently as the app scales up, I'm thinking this is caused by the Cloud Run load balancer assigning traffic prematurely to new instances, before these instances are ready to accept requests. As I continue to run the load test, the instances are continuously and repeatedly killed and restarted, so there is no mechanism for recovery while the load is on. I don't see any error logs from my NodeJS application, suggesting none of the failed requests actually reached my app. What can I do to avoid these errors? How does Cloud Run determine that a port is ready to accept requests? Is it something I misconfigured in my ExpressJS app or can I somehow delay Cloud Run a bit before sending requests to a new instance?

This turned out to be caused by a combination of Cloud Run auto-scaling maximum instance limit and Cloud SQL's connection limit. I was running a small Cloud SQL Postgres instance (3.75 GB / 1 vCPU) which comes with a default connection limit of 100. (https://cloud.google.com/sql/docs/quotas) By default, Cloud Run assigns a maximum instance count of 1000 for auto-scaling. During the load test, the sudden spike in request count pushed the auto-scaling to create hundreds of instances, which quickly exhausted the Cloud SQL connection limit of 100. This exact scenario is documented for Cloud SQL: https://cloud.google.com/sql/docs/postgres/connect-run#connection_limits_3 (it would be nice if this is also documented on Cloud Run, it did not immediately occur to me to look for documentation on Cloud SQL when this issue occurred) The solution is a combination of limiting the maximum instance count on Cloud Run to a number that is tolerable, and adjusting resource allocation / maximum connection limit on Cloud SQL. The exact configuration would obviously depend on the expected level of load.

500 on Google Cloud Run: The request failed because the instance could not start successfully

Tags:

google-cloud-run

I'm doing load testing on an ExpressJS app hosted on Google Cloud Run, upon spike increase in traffic, there is a period where I see many 500 errors in Stackdriver with the message "The request failed because the instance could not start successfully." - which effectively leads to server downtime.

Seeing that this error occurs more frequently as the app scales up, I'm thinking this is caused by the Cloud Run load balancer assigning traffic prematurely to new instances, before these instances are ready to accept requests.

As I continue to run the load test, the instances are continuously and repeatedly killed and restarted, so there is no mechanism for recovery while the load is on.

I don't see any error logs from my NodeJS application, suggesting none of the failed requests actually reached my app.

What can I do to avoid these errors?

How does Cloud Run determine that a port is ready to accept requests?

Is it something I misconfigured in my ExpressJS app or can I somehow delay Cloud Run a bit before sending requests to a new instance?

419

asked Nov 12 '19 03:11

Hans

1 Answers

This turned out to be caused by a combination of Cloud Run auto-scaling maximum instance limit and Cloud SQL's connection limit.

I was running a small Cloud SQL Postgres instance (3.75 GB / 1 vCPU) which comes with a default connection limit of 100. (https://cloud.google.com/sql/docs/quotas)

By default, Cloud Run assigns a maximum instance count of 1000 for auto-scaling. During the load test, the sudden spike in request count pushed the auto-scaling to create hundreds of instances, which quickly exhausted the Cloud SQL connection limit of 100.

This exact scenario is documented for Cloud SQL: https://cloud.google.com/sql/docs/postgres/connect-run#connection_limits_3 (it would be nice if this is also documented on Cloud Run, it did not immediately occur to me to look for documentation on Cloud SQL when this issue occurred)

The solution is a combination of limiting the maximum instance count on Cloud Run to a number that is tolerable, and adjusting resource allocation / maximum connection limit on Cloud SQL. The exact configuration would obviously depend on the expected level of load.

answered Sep 30 '22 18:09

Hans

Related questions
                            
                                How do I correlate request logs in Cloud Run?
                            
                                Firebase hosting caches Google Cloud Run requests
                            
                                Cloud build permission denied when deploy to cloud run with "--set-sql-instance" argument
                            
                                Expose Both Ports 8080 and 3000 For Cloud Run Deployment
                            
                                Handling a Cloud Run container shutdown
                            
                                Error 525 with Cloudflare and Google Cloud Run
                            
                                DDOS in Cloud Run
                            
                                How to redirect all http traffic to https in Google Cloud Run
                            
                                gcloud crashed (AttributeError): 'NoneType' object has no attribute 'revisionTemplate'
                            
                                How to securely connect to Cloud SQL from Cloud Run?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

500 on Google Cloud Run: The request failed because the instance could not start successfully

Tags:

google-cloud-run

Hans

People also ask

1 Answers

Hans

Recent Activity

Donate For Us