Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bad gateway errors at load on nginx + Unicorn (Rails 3 app)

I have an Rails (3.2) app that runs on nginx and unicorn on a cloud platform. The "box" is running on Ubuntu 12.04.

When the system load is at about 70% or above, nginx abruptly (and seemingly randomly) starts throwing 502 Bad gateway errors; when load is less there's nothing like it. I have experimented with various number of cores (4, 6, 10 - I can "change hardware" as it's on cloud platform), and the situation is always the same. (CPU load is similar to system load, userland is say 55%, the rest being system and stolen, with plenty of free memory, no swapping.)

502's usually come in batches but not always.

(I run one unicorn worker per core, and one or two nginx workers. See the relevant parts of the configs below when running on 10 cores.)

I don't really know how to track the cause of these errors. I suspect that it may have something to do with unicorn workers not being able to serve (in time?) but it looks odd because they do not seem to saturate the CPU and I see no reason why they would wait for IO (but I don't know how to make sure of that either).

Can you, please, help me with how to go about finding the cause?


Unicorn config (unicorn.rb):

worker_processes 10
working_directory "/var/www/app/current"
listen "/var/www/app/current/tmp/sockets/unicorn.sock", :backlog => 64
listen 2007, :tcp_nopush => true
timeout 90
pid "/var/www/app/current/tmp/pids/unicorn.pid"
stderr_path "/var/www/app/shared/log/unicorn.stderr.log"
stdout_path "/var/www/app/shared/log/unicorn.stdout.log"
preload_app true
GC.respond_to?(:copy_on_write_friendly=) and
  GC.copy_on_write_friendly = true
check_client_connection false

before_fork do |server, worker|
  ... I believe the stuff here is irrelevant ...
end
after_fork do |server, worker|
  ... I believe the stuff here is irrelevant ...
end

And the ngnix config:

/etc/nginx/nginx.conf:

worker_processes 2;
worker_rlimit_nofile 2048;
user www-data www-admin;
pid /var/run/nginx.pid;
error_log /var/log/nginx/nginx.error.log info;

events {
  worker_connections 2048;
  accept_mutex on; # "on" if nginx worker_processes > 1
  use epoll;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';
    access_log  /var/log/nginx/access.log  main;
    # optimialization efforts
    client_max_body_size        2m;
    client_body_buffer_size     128k;
    client_header_buffer_size   4k;
    large_client_header_buffers 10 4k;  # one for each core or one for each unicorn worker?
    client_body_temp_path       /tmp/nginx/client_body_temp;

    include /etc/nginx/conf.d/*.conf;
}

/etc/nginx/conf.d/app.conf:

sendfile on;
tcp_nopush on;
tcp_nodelay off;
gzip on;
gzip_http_version 1.0;
gzip_proxied any;
gzip_min_length 500;
gzip_disable "MSIE [1-6]\.";
gzip_types text/plain text/css text/javascript application/x-javascript;

upstream app_server {
  # fail_timeout=0 means we always retry an upstream even if it failed
  # to return a good HTTP response (in case the Unicorn master nukes a
  # single worker for timing out).
  server unix:/var/www/app/current/tmp/sockets/unicorn.sock fail_timeout=0;
}

server {
  listen 80 default deferred;
  server_name _;
  client_max_body_size 1G;
  keepalive_timeout 5;
  root /var/www/app/current/public;

  location ~ "^/assets/.*" {
      ...
  }

  # Prefer to serve static files directly from nginx to avoid unnecessary
  # data copies from the application server.
  try_files $uri/index.html $uri.html $uri @app;

  location @app {
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;
    proxy_redirect off;

    proxy_pass http://app_server;

    proxy_connect_timeout      90;
    proxy_send_timeout         90;
    proxy_read_timeout         90;

    proxy_buffer_size          128k;
    proxy_buffers              10 256k;  # one per core or one per unicorn worker?
    proxy_busy_buffers_size    256k;
    proxy_temp_file_write_size 256k;
    proxy_max_temp_file_size   512k;
    proxy_temp_path            /mnt/data/tmp/nginx/proxy_temp;

    open_file_cache max=1000 inactive=20s; 
    open_file_cache_valid    30s; 
    open_file_cache_min_uses 2;
    open_file_cache_errors   on;
  }
}
like image 288
fastcatch Avatar asked Mar 18 '13 13:03

fastcatch


People also ask

Why is my 502 Bad Gateway nginx?

In more technical words, A 502 Bad Gateway means that the proxy (gateway) server wasn't able to get a valid or any response from the upstream server. If you are seeing a 502 bad gateway error on a website, it means that the origin server sent out an invalid response to another server that acted as a gateway or proxy.

What is Nginx unicorn?

Nginx is a pure web server that's intended for serving up static content and/or redirecting the request to another socket to handle the request. Unicorn is a Rack web server and only intended to host a 'Rack App' which is usually generating dynamic content.


1 Answers

After googling for expressions found in the nginx error log it turned out to be a known issue which has nothing to do with nginx, little to do with unicorn and is rooted in OS (linux) settings.

The core of the problem is that the socket backlog is too short. There are various considerations how much this should be (whether you want to detect cluster member failure ASAP or keep the application push the load limits). But in any case the listen :backlog has needs tweaking.

I found that in my case a listen ... :backlog => 2048 was sufficient. (I did not experiment much, though there's a good hack to do it if you like, by having two sockets to communicate between nginx and unicorn with different backlogs and the longer being backup; then see in the nginx log how often the shorter queue fails.) Please note that it's not a the result of a scientific calculation and YMMV.

Note, however, the many OS-es (most linux distros, Ubuntu 12.04 included) have much lower OS level default limits on socket backlog sizes (as low as 128).

You can change the OS limits as follows (being root):

sysctl -w net.core.somaxconn=2048
sysctl -w net.core.netdev_max_backlog=2048

Add these to /etc/sysctl.conf to make the changes permanent. (/etc/sysctl.conf can be reloaded without rebooting with sysctl -p.)

There are mentions that you may have to increase the maximum number of files that can be opened by a process also (use ulimit -n and /etc/security/limits.conf for permanency). I had already done that for other reasons so I cannot tell if it makes a difference or not.

like image 98
fastcatch Avatar answered Oct 25 '22 19:10

fastcatch