I've inherited a node.js/Express app that is a bit of a mess. It's regularly and fairly randomly getting stuck and not responding to any request until it is restarted.
I suspect that something within the app is blocking and either getting stuck in a loop or making a request to an external api without using proper Async techniques, and never getting a response and never timing out at witch point the server just stops responding but doesn't crash.
I would obviously like to find the culprit code and fix the problem, however in the mean time I would like to find a way to automatically restart the server when it stops responding.
To test out solutions locally (since I don't currently know the actual culprit) I have created the following Express route which simulates the exact behavior I'm getting.
app.get('/block-block-block', function (req, res){
for(;;) {}
};
The question I have is give the above route being hit (which immediately stops the server from responding to anything), is there a way to detect the blockage in node internally and restart or shut down? And if not what is a good solution for checking when the server is not responding and restarting it?
Most searching I have done leads me to tools like forever and PM2. These work great if your app crashes but I don't really see any fuctionality for restarting when an app is radomley blocking.
I figured out how to solve this using native Node functionality. Migg's answer was good and lead me in the right direction, but it still doesn't show how to automatically restart when the event loop is completely blocked.
The trick is to use Node's native child_process module and the fork method to start the server from another node instance and have that instance ping the server for responses, restarting it when it's stuck. This is similar to how Forever and PM2 work. It's hard to believe there's not a simple way to implement this with either of those libraries, but this is how you can do it naively.
I have commented this code heavily to point out what everything is doing. Also note that I am using ES2015's Arrow Functions. Go read about them if you're not familiar.
var fork = require('child_process').fork;
var server, heartbeat;
function startServer () {
console.log('Starting server');
server = fork('server');
//when the server goes down restart it
server.on('close', (code) => {
startServer();
});
//when server sends a heartbeat message save it
server.on('message', (message) => {
heartbeat = message ? message.heartbeat : null;
});
//ask the server for a heartbeat
server.send({request: 'heartbeat'});
//wait 5 seconds and check if the server responded
setTimeout(checkHeartbeat, 5000);
}
function checkHeartbeat() {
if(heartbeat) {
console.log('Server is alive');
//clear the heart beat and send request for a new one
heartbeat = null;
server.send({request: 'heartbeat'});
//set another hearbeat check
setTimeout(checkHeartbeat, 5000);
} else {
console.log('Server looks stuck...killing');
server.kill();
}
}
startServer();
Be sure to change out server.js with whatever Node app you want to run.
Now on your server add the following to respond to the heartbeat request.
//listen and respond to heartbeat request from parent
process.on('message', (message) => {
if(message && message.request === 'heartbeat') {
process.send({heartbeat: 'thump'});
}
});
Finally add a timeout to test that it works (not for production!)
//block the even loop after 30 seconds
setTimeout(() => {
for(;;){}
}, 30000);
First of all you should try to find the problems in the code by reviewing it.
For the running app you should use pm2
. It has a setting to restart the app based on too much memory consumption. Directly from the docs:
pm2 start big-array.js --max-memory-restart 20M
Or using an ecosystem.json
:
{
"max_memory_restart" : "20M"
}
There are also several great articles about debugging memory leaks in node.js to find online. There is even a module that reports leaks which we used in the early days. This is too big a subject to fill it in here.
You can instrument your app to report the responsiveness of the event loop. So if some code blocks the loop for too long you can programmatically terminate the process. You will have to look at process.nextTick
.
You can introduce a measurement that for example calls process.nextTick
every X seconds, and if it then takes more than some defined time, send process.exit(1)
to terminate the process and let pm2
restart it.
The upside of this would be that your app runs most of the time. The downside would be that all users with open connections would get no answer when process.exit
is called.
To find memory leaks and other problems in the running code you should dive into https://www.joyent.com/developers/node/debug. There is a whole section about MDB which will help you find the problems, but it will take some time and getting used to it. All of this is too much information to not link to it here.
Best of luck with your app!
I have run into this problem once or twice and the answer has always been to handroll a stand alone monitoring service which sends requests to an endpoint at regular intervals. After so many failed or timed out requests the service will then restart the server.
Its not with out drawbacks however. The most obvious being that your application has to fail or hit some threshold before being restarted. That means that it could be down in production for minutes or even hours before a restart depending on your thresholds. however the alternative is to wait for consumers of the application to start complaining which probably sucks worse since they are most likely your customers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With