I'm running a Rails 3.2.21 app and deploy to a Ubuntu 12.04.5 box using capistrano (nginx and unicorn).
I have my app set for a zero-downtime deploy (at least I thought), with my config files looking more or less like these.
Here's the problem: When the deploy is nearly done and it restarts unicorn, when I watch my unicorn.log I see it fire up the new workers, reap the old ones... but then my app just hangs for 2-3 minutes. Any request to the app at this point hits the timeout window (which I set to 40 seconds) and returns my app's 500 error page.
Here is the first part of the output from unicorn.log as unicorn is restarting (I have 5 unicorn workers):
I, [2015-04-21T23:06:57.022492 #14347] INFO -- : master process ready
I, [2015-04-21T23:06:57.844273 #15378] INFO -- : worker=0 ready
I, [2015-04-21T23:06:57.944080 #15381] INFO -- : worker=1 ready
I, [2015-04-21T23:06:58.089655 #15390] INFO -- : worker=2 ready
I, [2015-04-21T23:06:58.230554 #14541] INFO -- : reaped #<Process::Status: pid 15551 exit 0> worker=4
I, [2015-04-21T23:06:58.231455 #14541] INFO -- : reaped #<Process::Status: pid 3644 exit 0> worker=0
I, [2015-04-21T23:06:58.249110 #15393] INFO -- : worker=3 ready
I, [2015-04-21T23:06:58.650007 #15396] INFO -- : worker=4 ready
I, [2015-04-21T23:07:01.246981 #14541] INFO -- : reaped #<Process::Status: pid 32645 exit 0> worker=1
I, [2015-04-21T23:07:01.561786 #14541] INFO -- : reaped #<Process::Status: pid 15534 exit 0> worker=2
I, [2015-04-21T23:07:06.657913 #14541] INFO -- : reaped #<Process::Status: pid 16821 exit 0> worker=3
I, [2015-04-21T23:07:06.658325 #14541] INFO -- : master complete
Afterwards, as the app hangs for those 2-3 minutes, here is what's happening:
E, [2015-04-21T23:07:38.069635 #14347] ERROR -- : worker=0 PID:15378 timeout (41s > 40s), killing
E, [2015-04-21T23:07:38.243005 #14347] ERROR -- : reaped #<Process::Status: pid 15378 SIGKILL (signal 9)> worker=0
E, [2015-04-21T23:07:39.647717 #14347] ERROR -- : worker=3 PID:15393 timeout (41s > 40s), killing
E, [2015-04-21T23:07:39.890543 #14347] ERROR -- : reaped #<Process::Status: pid 15393 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:07:40.727755 #16002] INFO -- : worker=0 ready
I, [2015-04-21T23:07:43.212395 #16022] INFO -- : worker=3 ready
E, [2015-04-21T23:08:24.511967 #14347] ERROR -- : worker=3 PID:16022 timeout (41s > 40s), killing
E, [2015-04-21T23:08:24.718512 #14347] ERROR -- : reaped #<Process::Status: pid 16022 SIGKILL (signal 9)> worker=3
I, [2015-04-21T23:08:28.010429 #16234] INFO -- : worker=3 ready
Eventually, after 2 or 3 minutes, the app starts being responsive again, but everything is more sluggish. You can see this very clearly in New Relic (the horizontal line marks the deploy, and the light blue area indicates Ruby):
I have an identical staging server, and I cannot replicate the issue in staging... granted, staging is under no load (I'm the only person trying to make page requests).
Here is my config/unicorn.rb file:
root = "/home/deployer/apps/myawesomeapp/current"
working_directory root
pid "#{root}/tmp/pids/unicorn.pid"
stderr_path "#{root}/log/unicorn.log"
stdout_path "#{root}/log/unicorn.log"
shared_path = "/home/deployer/apps/myawesomeapp/shared"
listen "/tmp/unicorn.myawesomeapp.sock"
worker_processes 5
timeout 40
preload_app true
before_exec do |server|
ENV['BUNDLE_GEMFILE'] = "#{root}/Gemfile"
end
before_fork do |server, worker|
if defined?(ActiveRecord::Base)
ActiveRecord::Base.connection.disconnect!
end
old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin"
if File.exists?(old_pid) && server.pid != old_pid
begin
Process.kill("QUIT", File.read(old_pid).to_i)
rescue Errno::ENOENT, Errno::ESRCH
end
end
end
after_fork do |server, worker|
if defined?(ActiveRecord::Base)
ActiveRecord::Base.establish_connection
end
end
And just to paint a complete picture, in my capistrano deploy.rb, the unicorn restart task looks like this:
namespace :deploy do
task :restart, roles: :app, except: { no_release: true } do
run "kill -s USR2 `cat #{release_path}/tmp/pids/unicorn.pid`"
end
end
Any ideas why the unicorn workers timeout right after the deploy? I thought the point of a zero-downtime was to keep the old ones around until the new ones are spun up and ready to serve?
Thanks!
UPDATE
I did another deploy, and this time kept an eye on production.log to see what was going on there. The only suspicious thing was the following lines, which were mixed in with normal requests:
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
Dalli/SASL authenticating as 7510de
Dalli/SASL: 7510de
UPDATE #2
As suggested by some of the answers below, I changed the before_fork
block to add sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
so the workers would be incrementally killed off.
Same result, terribly slow deploy, with the same spike I illustrated in the graph above. Just for context, out of my 5 worker processes, the first 4 sent a TTOU signal, and the 5th sent QUIT. Still, does not seem to have made a difference.
I came across a similar problem recently while trying to set up Rails/Nginx/Unicorn on Digital Ocean. I was able to get zero-downtime deploys to work after tweaking a few things. Here are a few things to try:
px aux | grep unicorn
to make sure the old processes are being removed.kill [pid]
to manually stop any unicorn processes still running.Here's my unicorn config for reference:
working_directory '/var/www/yourapp/current'
pid '/var/www/yourapp/current/tmp/pids/unicorn.pid'
stderr_path '/var/www/yourapp/log/unicorn.log'
stdout_path '/var/www/yourapp/log/unicorn.log'
listen '/tmp/unicorn.yourapp.sock'
worker_processes 2
timeout 30
preload_app true
before_fork do |server, worker|
old_pid = "/var/www/yourapp/current/tmp/pids/unicorn.pid.oldbin"
if old_pid != server.pid
begin
sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
Process.kill(sig, File.read(old_pid).to_i)
rescue Errno::ENOENT, Errno::ESRCH
end
end
end
deploy.rb
lock '3.4.0'
set :application, 'yourapp'
set :repo_url, '[email protected]:username/yourapp.git'
set :deploy_to, '/var/www/yourapp'
set :linked_files, fetch(:linked_files, []).push('config/database.yml', 'config/secrets.yml', 'config/application.yml')
set :linked_dirs, fetch(:linked_dirs, []).push('log', 'tmp/pids', 'tmp/cache', 'tmp/sockets', 'vendor/bundle', 'public/system')
set :format, :pretty
set :log_level, :info
set :rbenv_ruby, '2.1.3'
namespace :deploy do
after :restart, :clear_cache do
on roles(:web), in: :groups, limit: 3, wait: 10 do
end
end
end
after 'deploy:publishing', 'deploy:restart'
namespace :deploy do
task :restart do
#invoke 'unicorn:reload'
invoke 'unicorn:restart'
end
end
Are you vendoring unicorn and having cap run a bundle install on deploy? If so this could be an executable issue.
When you do a Capistrano deploy, cap creates a new release directory for your revision and moves the current
symlink to point to the new release. If you haven't told the running unicorn to gracefully update the path to its executable, it should work if you add this line:
Unicorn::HttpServer::START_CTX[0] = ::File.join(ENV['GEM_HOME'].gsub(/releases\/[^\/]+/, "current"),'bin','unicorn')
You can find some more information here. I think the before_fork block you have looks good, but I would add the sig = (worker.nr + 1) >= server.worker_processes ? :QUIT : :TTOU
line from @travisluong's answer as well; that will incrementally kill off the workers as the new ones spawn.
I would not remove preload_app true
, incidentally, as it greatly improves worker spawn time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With