Some 502 errors in GCP HTTP Load Balancing


Our load balancer is returning 502 errors for some requests. It is just a very low percentage of the total requests, we have around 36000 request per hour and about 40 errors per hour, so just a 0,01% of the requests returns an error.

The instances are healthy when the error occurs and we have added this forwarding rule to the firewall for the load balancer: tcp:1-5000 Apply to all targets

It is not a very serious problem because the application tolerates such errors, but I would like to know why they are given.

Any help will be apreciated.

2 Answers

It seems that there are no an easy solution for this.

As Mike Fotinakis explains in this blog (thank you for this info JasonG :)):

It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.

In my case I'm using Apache with the mpm_prefork module. The solution proposed is to increase the connection keepalive timeout to 650s, but this is not possible because each connection opens one new process (so this would represent a great waste of resources).

It seems that there are some new documentation about this problem on the official load balancer documentation page (search for "Timeouts and retries"): https://cloud.google.com/compute/docs/load-balancing/http/

They recommend to set the KeepAliveTimeout value to 620 in both cases (Apache and Nginx).

I had an issue w/ 502s that was unexplainable after recreating a load balancer and backend config. I recreated my backend & instance group for unmanaged instances and this seemed to fix the issue for me. I wasn't able to identify any issues in my configuration in GCP :(

But I had a lot more errors - 1/10. There are load balancer logs that will tell you what the cause is and docs explain the causes.

Eg mine were: jsonPayload: { statusDetails: "failed_to_pick_backend" @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBal‌​ancerLogEntry" }

If you're using nginx and it's on POSTS and the error is reported as "backend_connection_closed_before_data_sent_to_client" it may be fixed by changing your nginx timeouts. See this excellent blog post:


