Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node.js 100% CPU - gettimeofday call

I've got a long running node.js process that sometimes jumps to 100% CPU and stops responding to requests. The most recent time it did this I attached strace to the process, and here's what I saw:

Process 19523 attached - interrupt to quit
gettimeofday({1394558574, 234192}, NULL) = 0
gettimeofday({1394558574, 235381}, NULL) = 0
gettimeofday({1394558574, 306460}, NULL) = 0
mmap(0x3edab823a000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3edab823a000
munmap(0x3edab823a000, 811008)          = 0
munmap(0x3edab8400000, 237568)          = 0
mmap(0x3edab8300000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3edab8300000
gettimeofday({1394558574, 316435}, NULL) = 0
gettimeofday({1394558574, 316576}, NULL) = 0
gettimeofday({1394558574, 316677}, NULL) = 0
gettimeofday({1394558574, 316775}, NULL) = 0
gettimeofday({1394558574, 316879}, NULL) = 0
gettimeofday({1394558574, 316978}, NULL) = 0
gettimeofday({1394558574, 317439}, NULL) = 0
gettimeofday({1394558574, 317537}, NULL) = 0
gettimeofday({1394558574, 318139}, NULL) = 0
gettimeofday({1394558574, 318234}, NULL) = 0
gettimeofday({1394558574, 318343}, NULL) = 0
gettimeofday({1394558574, 318437}, NULL) = 0
gettimeofday({1394558574, 318530}, NULL) = 0
gettimeofday({1394558574, 318624}, NULL) = 0
gettimeofday({1394558574, 319135}, NULL) = 0
gettimeofday({1394558574, 319648}, NULL) = 0
gettimeofday({1394558574, 319769}, NULL) = 0
gettimeofday({1394558574, 319975}, NULL) = 0
futex(0x7f5b380008c8, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1394558574, 322266}, NULL) = 0
gettimeofday({1394558574, 322426}, NULL) = 0
gettimeofday({1394558574, 322520}, NULL) = 0
gettimeofday({1394558574, 322759}, NULL) = 0
gettimeofday({1394558574, 322853}, NULL) = 0
gettimeofday({1394558574, 322995}, NULL) = 0
futex(0x7f5b380008c8, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1394558574, 417614}, NULL) = 0
gettimeofday({1394558575, 386566}, NULL) = 0
gettimeofday({1394558575, 387704}, NULL) = 0
gettimeofday({1394558575, 463410}, NULL) = 0
mmap(0x24cfd260f000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x24cfd260f000
munmap(0x24cfd260f000, 987136)          = 0
munmap(0x24cfd2800000, 61440)           = 0
mmap(0x24cfd2700000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x24cfd2700000

That's a lot of gettimeofday calls, and not much else! What might cause node.js to get stuck like this?

UPDATE: I upgraded from an older version of node.js to 10.29 (I think) and this went away. I've just upgraded to 10.33 and the problem has come back. I've made a little more progress on debugging it this time out. First strace:

$ sudo strace -p 11812 -c
Process 11812 attached - interrupt to quit
Process 11812 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000192           0      2780           gettimeofday
  0.00    0.000000           0      1390           getrusage
  0.00    0.000000           0        31           futex
  0.00    0.000000           0      1390           clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00    0.000192                  5591           total

From Node.js: How to attach to a running process and to debug the server with a console? I found out about attaching the node debugger to a running process, and here's where I got with that:

$ sudo kill -s SIGUSR1 11812
$ sudo node debug -p 11812
connecting... ok
break in timers.js:79
  77 }
  78
  79 function listOnTimeout() {
  80   var msecs = this.msecs;
  81   var list = this;
debug> bt
#0 timers.js:79:23

So it looks like it is definitely related to a timer, but I don't know how to move further up the stack to find out where in my code the problem is getting triggered.

like image 307
Ben Dowling Avatar asked Mar 12 '14 06:03

Ben Dowling


3 Answers

My guess is that someone manually implemented a blocking "setTimeout". This could happen if someone doesn't want to release control of the main JS thread in an effort to prevent a potential race condition.

For obvious reasons this is a bad practice for production code, but I've used it on occasion in debugging to force asynchronous processes to execute in a specific order.

You could look for something silly like this:

var mockAsyncFunction = function (cb) {
    setTimeout(function () {
        cb(null, 'dummy_result')
    }, 5000);
};

var myResult = null;
mockAsyncFunction(function (err, result) {
    myResult = result;
});
var timeOut = 10000; // timeout in 10 sec.
var timeStart = new Date().getTime();


while (1) {
    if (new Date().getTime() - timeStart > 10000) {
        break;
    }
}
console.log('DONE');

or something more nefarious with nextTick recursion like:

var timeStart = new Date().getTime();

var recurseUntilDone = function () {
    if (new Date().getTime() - timeStart < 10000) {
        process.nextTick(recurseUntilDone);
    } else {
        console.log('Done recursing');
    }
};

recurseUntilDone();
like image 164
mjhm Avatar answered Nov 05 '22 07:11

mjhm


Use node-inspector to be able to pause your code when the CPU is 100% - my bet is also on some badly implemented loop checking state until a certain time has passed, but they can be hard to find.

Attach the debugger with --debug when starting node (ie. node index.js --debug) and in a separate window run node-inspector. Use Chrome to connect to your debug session (the url is outputted from node-inspector command) and wait until the problem occurs and pause the execution and you should be able to find it.

like image 27
Christian Landgren Avatar answered Nov 05 '22 05:11

Christian Landgren


We saw this too, in both production and development. Also looking for an answer. We started investigating the node run-time, but the problem is so infrequent it's not getting any priority. Because it's completely cpu bound, no system calls, it's hard capture with an strace.

It won't be timed loop checking state, because a new Date().getTime() loop does not make any calls to gettimeofday (in node v0.10.29; it just does a long series of nanosleeps in one pid and just futex calls in the other. Clever, actually). Same for Date.now().

like image 2
Andras Avatar answered Nov 05 '22 05:11

Andras