Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node.js + Cluster :: Restarting Workers Without Downtime?

Tags:

node.js

For reasons I'll breeze over here, I want to cause the workers started by cluster (in node.js) to live for 1 hour each, before restarting themselves.

The caveat is that I need to have zero downtime. Thus, simply executing a destroy() on each worker is not acceptable, as it takes down the cluster until the workers are restarted.

Here's my base code:

if(cluster.isMaster) {
    for(var i=0; i<2; i++)
    {
        cluster.fork();
    }
    return;
}
require('./api').startup(settings, process.argv, function(error, api){
    if(error)
    {
        console.log('API failed to start: '+error);
    }
    else 
    {
        console.log('API is running');
    }
});

The api.js script implements express to start a pretty standard RESTful JSON API.

like image 303
Zane Claes Avatar asked Oct 16 '12 19:10

Zane Claes


1 Answers

The way I ended up doing this was to make sure I had at least 2 workers running, and then only restart one at a time.

This bit of code will automatically restart workers who commit suicide via cluster.worker.destroy()

cluster.on('exit', function(worker, code, signal) {
  if (worker.suicide === true) {
    console.log(new Date()+' Worker committed suicide');
    cluster.fork();
  }
});

From there, it is a simple matter of making each worker commit suicide via a setTimeout() (or whatever other condition you wish to employ). My approach was actually to have the master kill the workers:

function killWorker(worker)
{
    return function() {
        worker.destroy();  
    };
}

// This should be run on cluster.isMaster only
function killWorkers()
{
    var delay = 0;
    for (var id in cluster.workers) {
        var func = killWorker(cluster.workers[id]);
        if(delay==0)
            func();
        else
            setTimeout(func, delay);
        delay += 60000 * 5;// 5 minute delay, inserted to give time for each worker to re-spool itself
    }
}

As you can see, this inserts a 5 minute delay between restarting workers, thus giving each worker plenty of time to restart itself -- meaning that there should never be a case where all workers are down.

like image 142
Zane Claes Avatar answered Nov 04 '22 16:11

Zane Claes