Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to debug/analyze extremely long GC pauses in Node.js/V8

Tags:

I am trying to analyze a memory/GC problem in a relatively complex Node.js server app. Even under very moderate load, it is becoming unresponsive for noticeable periods, and these pauses grow longer over time. Running with the --trace-gc parameter shows that extremely long garbage collection times are the likely cause:

[4805]      537 ms: Mark-sweep 17.6 (46.4) -> 10.3 (47.4) MB, 20 ms [allocation failure] [GC in old space requested]. [4805]     1338 ms: Mark-sweep 31.3 (58.4) -> 19.2 (57.2) MB, 40 ms [allocation failure] [promotion limit reached]. [4805]     2662 ms: Mark-sweep 58.0 (79.2) -> 43.9 (85.2) MB, 109 ms [Runtime::PerformGC] [promotion limit reached]. [4805]     4014 ms: Mark-sweep 90.1 (111.5) -> 70.6 (113.9) MB, 114 ms [allocation failure] [promotion limit reached]. [4805]     7283 ms: Mark-sweep 129.7 (153.9) -> 112.0 (158.9) MB, 511 ms [allocation failure] [promotion limit reached]. [4805]    10979 ms: Mark-sweep 184.6 (210.9) -> 160.3 (212.9) MB, 422 ms [Runtime::PerformGC] [promotion limit reached]. [4805]  1146869 ms: Mark-sweep 243.8 (271.4) -> 191.6 (267.9) MB, 1856 ms [allocation failure] [promotion limit reached]. [4805]  1731440 ms: Mark-sweep 282.1 (307.4) -> 197.5 (298.9) MB, 1 / 11230 ms [allocation failure] [promotion limit reached]. [4805]  2024385 ms: Mark-sweep 291.0 (320.8) -> 197.3 (306.9) MB, 9076 ms [Runtime::PerformGC] [promotion limit reached]. [4805]  2623396 ms: Mark-sweep 290.9 (317.1) -> 196.9 (311.9) MB, 1 / 15401 ms [allocation failure] [promotion limit reached]. [4805]  3223769 ms: Mark-sweep 291.4 (323.6) -> 187.8 (318.9) MB, 1 / 13385 ms [allocation failure] [promotion limit reached]. [4805]  4225777 ms: Mark-sweep 280.1 (324.2) -> 190.6 (315.9) MB, 1 / 13266 ms [allocation failure] [promotion limit reached]. [4805]  4705442 ms: Mark-sweep 286.2 (321.4) -> 195.2 (314.9) MB, 1 / 17256 ms [Runtime::PerformGC] [promotion limit reached]. [4805]  5225595 ms: Mark-sweep 288.3 (324.0) -> 201.7 (316.9) MB, 1 / 22266 ms [Runtime::PerformGC] [promotion limit reached]. [4805]  6127372 ms: Mark-sweep 296.5 (324.6) -> 200.5 (316.9) MB, 1 / 28325 ms [allocation failure] [promotion limit reached]. [4805]  6523938 ms: Mark-sweep 297.8 (328.9) -> 198.8 (323.9) MB, 1 / 27213 ms [allocation failure] [promotion limit reached]. [4805]  7355394 ms: Mark-sweep 292.1 (330.7) -> 223.9 (322.9) MB, 60202 ms [allocation failure] [promotion limit reached]. 

The full (--trace-gc-verbose) output can be found here.

These logs are the result of running the server with the following parameters:

--expose-gc --trace-gc --trace-gc-verbose --trace-gc-ignore-scavenger --max-old-space-size=1000 

The longer it runs, the longer the pauses get (often several minutes), until it eventually locks up completely after a couple of hours. Available memory never runs out, and the RSS does not even come close to the 1000mb old space limit, so it does not seem to be a leak. It looks to me like there might be something rather unusual within the code, making it very "difficult" for the GC to do its job in acceptable time frames.

My question is: How do I go about analyzing this problem further, and narrowing down the possible causes? Any recommendable tools to help with issues like this one? I'm essentially looking for a more efficient approach than naively turning off and on parts of the code, which is extremely cumbersome and time consuming.

As an aside, I would greatly appreciate links to any documentation that explains the terms/messages used in the GC debug output (such as "promotion limit reached"), and the numbers listed there. I have a very basic understanding of how the V8 GC works (this helped a lot), but most of that output is still beyond me.

In case it matters: This is running on Node.js v0.10.33 on Ubuntu 14.04 Server.

EDIT: A while ago we moved on to io.js, where this problem simply does not occur anymore (presumably due to the much more recent V8 version). I never did find the cause for this issue with Node v0.10 though, let alone a fix.

like image 952
d0gb3r7 Avatar asked Jan 13 '15 19:01

d0gb3r7


People also ask

What is GC pause time Nodejs?

The total GC pause time is up to 47.8s, of which the majority is spent on scavenge. During the three-minute GC tracing, a total of 988 scavenge collections are performed. Each scavenge task takes 50-60 ms.

How do you investigate memory leaks in node JS?

A quick way to fix Node. js memory leaks in the short term is to restart the app. Make sure to do this first and then dedicate the time to seek out the root cause of the memory leak.

What is Nodejs external memory?

external , the amount of memory consumed by off-heap data (buffers) used by Node; this is where objects, strings, and closures are stored. arrayBuffers , the amount of memory allocation for ArrayBuffers and SharedArrayBuffers (the external memory size also includes this memory value)


1 Answers

Are you able to reproduce the issue on a single node? I think if I was in the situation I would probably do a mix of the following:

  • Write a loader that let me replicate on a local instance
  • If not, put an instance in prod that will receive a subset of traffic, and modify it to do the below
  • Add node-heapdump to your source, call it at an interval and export the results to a json file in N minute intervals.
  • Potentially, if you're running locally you might also be able to take advantage of memwatch.
  • Wait for the slow GC to start.
  • Grab a number of heap dumps around a time when you know slow GC started to occur.
  • Load them up into chrome and analyze them using the three snap shot technique (I suppose you could call this the N snapshot technique in our case)

Basically, you will load the heaps and start looking through them to try and understand what type of thing is stacking up, what is holding it and as a result you'll get an understanding of why GC is taking so long.

Available memory never runs out, and the RSS does not even come close to the 1000mb old space limit, so it does not seem to be a leak. It looks to me like there might be something rather unusual within the code, making it very "difficult" for the GC to do its job in acceptable time frames.

Here you may be looking for long and circular retainment trees. But at the end of the day even that is the case you should be able to identify what the root of that tree is, what is in it and experiment with ways to reduce remove.

I also agree with @dandavis and suspect closures.

like image 149
j03m Avatar answered Nov 07 '22 05:11

j03m