I've got a long running node.js process that sometimes jumps to 100% CPU and stops responding to requests. The most recent time it did this I attached strace
to the process, and here's what I saw:
Process 19523 attached - interrupt to quit
gettimeofday({1394558574, 234192}, NULL) = 0
gettimeofday({1394558574, 235381}, NULL) = 0
gettimeofday({1394558574, 306460}, NULL) = 0
mmap(0x3edab823a000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3edab823a000
munmap(0x3edab823a000, 811008) = 0
munmap(0x3edab8400000, 237568) = 0
mmap(0x3edab8300000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3edab8300000
gettimeofday({1394558574, 316435}, NULL) = 0
gettimeofday({1394558574, 316576}, NULL) = 0
gettimeofday({1394558574, 316677}, NULL) = 0
gettimeofday({1394558574, 316775}, NULL) = 0
gettimeofday({1394558574, 316879}, NULL) = 0
gettimeofday({1394558574, 316978}, NULL) = 0
gettimeofday({1394558574, 317439}, NULL) = 0
gettimeofday({1394558574, 317537}, NULL) = 0
gettimeofday({1394558574, 318139}, NULL) = 0
gettimeofday({1394558574, 318234}, NULL) = 0
gettimeofday({1394558574, 318343}, NULL) = 0
gettimeofday({1394558574, 318437}, NULL) = 0
gettimeofday({1394558574, 318530}, NULL) = 0
gettimeofday({1394558574, 318624}, NULL) = 0
gettimeofday({1394558574, 319135}, NULL) = 0
gettimeofday({1394558574, 319648}, NULL) = 0
gettimeofday({1394558574, 319769}, NULL) = 0
gettimeofday({1394558574, 319975}, NULL) = 0
futex(0x7f5b380008c8, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1394558574, 322266}, NULL) = 0
gettimeofday({1394558574, 322426}, NULL) = 0
gettimeofday({1394558574, 322520}, NULL) = 0
gettimeofday({1394558574, 322759}, NULL) = 0
gettimeofday({1394558574, 322853}, NULL) = 0
gettimeofday({1394558574, 322995}, NULL) = 0
futex(0x7f5b380008c8, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1394558574, 417614}, NULL) = 0
gettimeofday({1394558575, 386566}, NULL) = 0
gettimeofday({1394558575, 387704}, NULL) = 0
gettimeofday({1394558575, 463410}, NULL) = 0
mmap(0x24cfd260f000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x24cfd260f000
munmap(0x24cfd260f000, 987136) = 0
munmap(0x24cfd2800000, 61440) = 0
mmap(0x24cfd2700000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x24cfd2700000
That's a lot of gettimeofday
calls, and not much else! What might cause node.js to get stuck like this?
UPDATE: I upgraded from an older version of node.js to 10.29 (I think) and this went away. I've just upgraded to 10.33 and the problem has come back. I've made a little more progress on debugging it this time out. First strace:
$ sudo strace -p 11812 -c
Process 11812 attached - interrupt to quit
Process 11812 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000192 0 2780 gettimeofday
0.00 0.000000 0 1390 getrusage
0.00 0.000000 0 31 futex
0.00 0.000000 0 1390 clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00 0.000192 5591 total
From Node.js: How to attach to a running process and to debug the server with a console? I found out about attaching the node debugger to a running process, and here's where I got with that:
$ sudo kill -s SIGUSR1 11812
$ sudo node debug -p 11812
connecting... ok
break in timers.js:79
77 }
78
79 function listOnTimeout() {
80 var msecs = this.msecs;
81 var list = this;
debug> bt
#0 timers.js:79:23
So it looks like it is definitely related to a timer, but I don't know how to move further up the stack to find out where in my code the problem is getting triggered.
My guess is that someone manually implemented a blocking "setTimeout". This could happen if someone doesn't want to release control of the main JS thread in an effort to prevent a potential race condition.
For obvious reasons this is a bad practice for production code, but I've used it on occasion in debugging to force asynchronous processes to execute in a specific order.
You could look for something silly like this:
var mockAsyncFunction = function (cb) {
setTimeout(function () {
cb(null, 'dummy_result')
}, 5000);
};
var myResult = null;
mockAsyncFunction(function (err, result) {
myResult = result;
});
var timeOut = 10000; // timeout in 10 sec.
var timeStart = new Date().getTime();
while (1) {
if (new Date().getTime() - timeStart > 10000) {
break;
}
}
console.log('DONE');
or something more nefarious with nextTick recursion like:
var timeStart = new Date().getTime();
var recurseUntilDone = function () {
if (new Date().getTime() - timeStart < 10000) {
process.nextTick(recurseUntilDone);
} else {
console.log('Done recursing');
}
};
recurseUntilDone();
Use node-inspector to be able to pause your code when the CPU is 100% - my bet is also on some badly implemented loop checking state until a certain time has passed, but they can be hard to find.
Attach the debugger with --debug when starting node (ie. node index.js --debug
) and in a separate window run node-inspector
. Use Chrome to connect to your debug session (the url is outputted from node-inspector command) and wait until the problem occurs and pause the execution and you should be able to find it.
We saw this too, in both production and development. Also looking for an answer. We started investigating the node run-time, but the problem is so infrequent it's not getting any priority. Because it's completely cpu bound, no system calls, it's hard capture with an strace.
It won't be timed loop checking state, because a new Date().getTime() loop does not make any calls to gettimeofday (in node v0.10.29; it just does a long series of nanosleeps in one pid and just futex calls in the other. Clever, actually). Same for Date.now().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With