Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicorn workers timeout after "zero downtime" deploy with capistrano

I'm running a Rails 3.2.21 app and deploy to a Ubuntu 12.04.5 box using capistrano (nginx and unicorn).

I have my app set for a zero-downtime deploy (at least I thought), with my config files looking more or less like these.

Here's the problem: When the deploy is nearly done and it restarts unicorn, when I watch my unicorn.log I see it fire up the new workers, reap the old ones... but then my app just hangs for 2-3 minutes. Any request to the app at this point hits the timeout window (which I set to 40 seconds) and returns my app's 500 error page.

Here is the first part of the output from unicorn.log as unicorn is restarting (I have 5 unicorn workers):

I, [2015-04-21T23:06:57.022492 #14347]  INFO -- : master process ready
I, [2015-04-21T23:06:57.844273 #15378]  INFO -- : worker=0 ready
I, [2015-04-21T23:06:57.944080 #15381]  INFO -- : worker=1 ready
I, [2015-04-21T23:06:58.089655 #15390]  INFO -- : worker=2 ready
I, [2015-04-21T23:06:58.230554 #14541]  INFO -- : reaped #<Process::Status: pid 15551 exit 0> worker=4
I, [2015-04-21T23:06:58.231455 #14541]  INFO -- : reaped #<Process::Status: pid 3644 exit 0> worker=0
I, [2015-04-21T23:06:58.249110 #15393]  INFO -- : worker=3 ready
I, [2015-04-21T23:06:58.650007 #15396]  INFO -- : worker=4 ready
I, [2015-04-21T23:07:01.246981 #14541]  INFO -- : reaped #<Process::Status: pid 32645 exit 0> worker=1
I, [2015-04-21T23:07:01.561786 #14541]  INFO -- : reaped #<Process::Status: pid 15534 exit 0> worker=2
I, [2015-04-21T23:07:06.657913 #14541]  INFO -- : reaped #<Process::Status: pid 16821 exit 0> worker=3
I, [2015-04-21T23:07:06.658325 #14541]  INFO -- : master complete

Afterwards, as the app hangs for those 2-3 minutes, here is what's happening:

E, [2015-04-21T23:07:38.069635 #14347] ERROR -- : worker=0 PID:15378 timeout (41s > 40s), killing
E, [2015-04-21T23:07:38.243005 #14347] ERROR -- : reaped #<Process::Status: pid 15378 SIGKILL (signal 9)> worker=0
E, [2015-04-21T23:07:39.647717 #14347] ERROR -- : worker=3 PID:15393 timeout (41s > 40s), killing
E, [2015-04-21T23:07:39.890543 #14347] ERROR -- : reaped #<Process::Status: pid 15393 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:07:40.727755 #16002]  INFO -- : worker=0 ready
I, [2015-04-21T23:07:43.212395 #16022]  INFO -- : worker=3 ready
E, [2015-04-21T23:08:24.511967 #14347] ERROR -- : worker=3 PID:16022 timeout (41s > 40s), killing
E, [2015-04-21T23:08:24.718512 #14347] ERROR -- : reaped #<Process::Status: pid 16022 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:08:28.010429 #16234]  INFO -- : worker=3 ready

Eventually, after 2 or 3 minutes, the app starts being responsive again, but everything is more sluggish. You can see this very clearly in New Relic (the horizontal line marks the deploy, and the light blue area indicates Ruby):

New Relic graph during and after deploy

I have an identical staging server, and I cannot replicate the issue in staging... granted, staging is under no load (I'm the only person trying to make page requests).

Here is my config/unicorn.rb file:

root = "/home/deployer/apps/myawesomeapp/current"
working_directory root

pid "#{root}/tmp/pids/unicorn.pid"
stderr_path "#{root}/log/unicorn.log"
stdout_path "#{root}/log/unicorn.log"

shared_path = "/home/deployer/apps/myawesomeapp/shared"

listen "/tmp/unicorn.myawesomeapp.sock"
worker_processes 5
timeout 40

preload_app true

before_exec do |server|
  ENV['BUNDLE_GEMFILE'] = "#{root}/Gemfile"
end

before_fork do |server, worker|
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.connection.disconnect!
  end

  old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin"
  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH

    end
  end
end

after_fork do |server, worker|
  if defined?(ActiveRecord::Base)
    ActiveRecord::Base.establish_connection
  end
end

And just to paint a complete picture, in my capistrano deploy.rb, the unicorn restart task looks like this:

namespace :deploy do
  task :restart, roles: :app, except: { no_release: true } do
    run "kill -s USR2 `cat #{release_path}/tmp/pids/unicorn.pid`"
  end
end

Any ideas why the unicorn workers timeout right after the deploy? I thought the point of a zero-downtime was to keep the old ones around until the new ones are spun up and ready to serve?

Thanks!

UPDATE

I did another deploy, and this time kept an eye on production.log to see what was going on there. The only suspicious thing was the following lines, which were mixed in with normal requests:

Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de

UPDATE #2

As suggested by some of the answers below, I changed the before_fork block to add sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU so the workers would be incrementally killed off.

Same result, terribly slow deploy, with the same spike I illustrated in the graph above. Just for context, out of my 5 worker processes, the first 4 sent a TTOU signal, and the 5th sent QUIT. Still, does not seem to have made a difference.

like image 686
DelPiero Avatar asked Apr 22 '15 17:04

DelPiero


2 Answers

I came across a similar problem recently while trying to set up Rails/Nginx/Unicorn on Digital Ocean. I was able to get zero-downtime deploys to work after tweaking a few things. Here are a few things to try:

  1. Reduce the number of worker process.
  2. Increase the memory of your server. I was getting timeouts on the 512MB RAM droplet. Seemed to fix the issue when I increased it to 1GB.
  3. Use the "capistrano3-unicorn" gem.
  4. If preload_app true, use restart (USR2). If false, use reload (HUP).
  5. Ensure "tmp/pids" is in the set as a linked_dirs in deploy.rb.
  6. Use px aux | grep unicorn to make sure the old processes are being removed.
  7. Use kill [pid] to manually stop any unicorn processes still running.

Here's my unicorn config for reference:

working_directory '/var/www/yourapp/current'
pid '/var/www/yourapp/current/tmp/pids/unicorn.pid'
stderr_path '/var/www/yourapp/log/unicorn.log'
stdout_path '/var/www/yourapp/log/unicorn.log'
listen '/tmp/unicorn.yourapp.sock'
worker_processes 2
timeout 30
preload_app true

before_fork do |server, worker|
  old_pid = "/var/www/yourapp/current/tmp/pids/unicorn.pid.oldbin"
  if old_pid != server.pid
    begin
    sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
    Process.kill(sig, File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
    end
  end
end

deploy.rb

lock '3.4.0'

set :application, 'yourapp'
set :repo_url, '[email protected]:username/yourapp.git'
set :deploy_to, '/var/www/yourapp'
set :linked_files, fetch(:linked_files, []).push('config/database.yml', 'config/secrets.yml', 'config/application.yml')
set :linked_dirs, fetch(:linked_dirs, []).push('log', 'tmp/pids', 'tmp/cache', 'tmp/sockets', 'vendor/bundle', 'public/system')
set :format, :pretty
set :log_level, :info
set :rbenv_ruby, '2.1.3'

namespace :deploy do
  after :restart, :clear_cache do
    on roles(:web), in: :groups, limit: 3, wait: 10 do
    end
  end
end

after 'deploy:publishing', 'deploy:restart'
namespace :deploy do
  task :restart do
    #invoke 'unicorn:reload'
    invoke 'unicorn:restart'
  end
end
like image 112
travisluong Avatar answered Oct 11 '22 16:10

travisluong


Are you vendoring unicorn and having cap run a bundle install on deploy? If so this could be an executable issue.

When you do a Capistrano deploy, cap creates a new release directory for your revision and moves the current symlink to point to the new release. If you haven't told the running unicorn to gracefully update the path to its executable, it should work if you add this line:

Unicorn::HttpServer::START_CTX[0] = ::File.join(ENV['GEM_HOME'].gsub(/releases\/[^\/]+/, "current"),'bin','unicorn')

You can find some more information here. I think the before_fork block you have looks good, but I would add the sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU line from @travisluong's answer as well; that will incrementally kill off the workers as the new ones spawn.

I would not remove preload_app true, incidentally, as it greatly improves worker spawn time.

like image 38
Marcus Walser Avatar answered Oct 11 '22 17:10

Marcus Walser