How to debug/analyze extremely long GC pauses in Node.js/V8

Tags:

I am trying to analyze a memory/GC problem in a relatively complex Node.js server app. Even under very moderate load, it is becoming unresponsive for noticeable periods, and these pauses grow longer over time. Running with the --trace-gc parameter shows that extremely long garbage collection times are the likely cause:

[4805]      537 ms: Mark-sweep 17.6 (46.4) -> 10.3 (47.4) MB, 20 ms [allocation failure] [GC in old space requested]. [4805]     1338 ms: Mark-sweep 31.3 (58.4) -> 19.2 (57.2) MB, 40 ms [allocation failure] [promotion limit reached]. [4805]     2662 ms: Mark-sweep 58.0 (79.2) -> 43.9 (85.2) MB, 109 ms [Runtime::PerformGC] [promotion limit reached]. [4805]     4014 ms: Mark-sweep 90.1 (111.5) -> 70.6 (113.9) MB, 114 ms [allocation failure] [promotion limit reached]. [4805]     7283 ms: Mark-sweep 129.7 (153.9) -> 112.0 (158.9) MB, 511 ms [allocation failure] [promotion limit reached]. [4805]    10979 ms: Mark-sweep 184.6 (210.9) -> 160.3 (212.9) MB, 422 ms [Runtime::PerformGC] [promotion limit reached]. [4805]  1146869 ms: Mark-sweep 243.8 (271.4) -> 191.6 (267.9) MB, 1856 ms [allocation failure] [promotion limit reached]. [4805]  1731440 ms: Mark-sweep 282.1 (307.4) -> 197.5 (298.9) MB, 1 / 11230 ms [allocation failure] [promotion limit reached]. [4805]  2024385 ms: Mark-sweep 291.0 (320.8) -> 197.3 (306.9) MB, 9076 ms [Runtime::PerformGC] [promotion limit reached]. [4805]  2623396 ms: Mark-sweep 290.9 (317.1) -> 196.9 (311.9) MB, 1 / 15401 ms [allocation failure] [promotion limit reached]. [4805]  3223769 ms: Mark-sweep 291.4 (323.6) -> 187.8 (318.9) MB, 1 / 13385 ms [allocation failure] [promotion limit reached]. [4805]  4225777 ms: Mark-sweep 280.1 (324.2) -> 190.6 (315.9) MB, 1 / 13266 ms [allocation failure] [promotion limit reached]. [4805]  4705442 ms: Mark-sweep 286.2 (321.4) -> 195.2 (314.9) MB, 1 / 17256 ms [Runtime::PerformGC] [promotion limit reached]. [4805]  5225595 ms: Mark-sweep 288.3 (324.0) -> 201.7 (316.9) MB, 1 / 22266 ms [Runtime::PerformGC] [promotion limit reached]. [4805]  6127372 ms: Mark-sweep 296.5 (324.6) -> 200.5 (316.9) MB, 1 / 28325 ms [allocation failure] [promotion limit reached]. [4805]  6523938 ms: Mark-sweep 297.8 (328.9) -> 198.8 (323.9) MB, 1 / 27213 ms [allocation failure] [promotion limit reached]. [4805]  7355394 ms: Mark-sweep 292.1 (330.7) -> 223.9 (322.9) MB, 60202 ms [allocation failure] [promotion limit reached].

The full (--trace-gc-verbose) output can be found here.

These logs are the result of running the server with the following parameters:

--expose-gc --trace-gc --trace-gc-verbose --trace-gc-ignore-scavenger --max-old-space-size=1000

The longer it runs, the longer the pauses get (often several minutes), until it eventually locks up completely after a couple of hours. Available memory never runs out, and the RSS does not even come close to the 1000mb old space limit, so it does not seem to be a leak. It looks to me like there might be something rather unusual within the code, making it very "difficult" for the GC to do its job in acceptable time frames.

My question is: How do I go about analyzing this problem further, and narrowing down the possible causes? Any recommendable tools to help with issues like this one? I'm essentially looking for a more efficient approach than naively turning off and on parts of the code, which is extremely cumbersome and time consuming.

As an aside, I would greatly appreciate links to any documentation that explains the terms/messages used in the GC debug output (such as "promotion limit reached"), and the numbers listed there. I have a very basic understanding of how the V8 GC works (this helped a lot), but most of that output is still beyond me.

In case it matters: This is running on Node.js v0.10.33 on Ubuntu 14.04 Server.

EDIT: A while ago we moved on to io.js, where this problem simply does not occur anymore (presumably due to the much more recent V8 version). I never did find the cause for this issue with Node v0.10 though, let alone a fix.

952

asked Jan 13 '15 19:01

d0gb3r7

1 Answers

Are you able to reproduce the issue on a single node? I think if I was in the situation I would probably do a mix of the following:

Write a loader that let me replicate on a local instance
If not, put an instance in prod that will receive a subset of traffic, and modify it to do the below
Add node-heapdump to your source, call it at an interval and export the results to a json file in N minute intervals.
Potentially, if you're running locally you might also be able to take advantage of memwatch.
Wait for the slow GC to start.
Grab a number of heap dumps around a time when you know slow GC started to occur.
Load them up into chrome and analyze them using the three snap shot technique (I suppose you could call this the N snapshot technique in our case)

Basically, you will load the heaps and start looking through them to try and understand what type of thing is stacking up, what is holding it and as a result you'll get an understanding of why GC is taking so long.

Available memory never runs out, and the RSS does not even come close to the 1000mb old space limit, so it does not seem to be a leak. It looks to me like there might be something rather unusual within the code, making it very "difficult" for the GC to do its job in acceptable time frames.

Here you may be looking for long and circular retainment trees. But at the end of the day even that is the case you should be able to identify what the root of that tree is, what is in it and experiment with ways to reduce remove.

I also agree with @dandavis and suspect closures.

149

answered Nov 07 '22 05:11

j03m

Related questions
                            
                                QSyntaxHighlighter - text selection overrides style
                            
                                Prevent reCaptcha multiple image selections
                            
                                Java default methods is slower than the same code but in an abstract class
                            
                                Xcode 7 debug output: "ERROR:177: timed out...mMajorChangePending=0"
                            
                                Visual Studio 2015: The hash value is not correct
                            
                                One Component Multiple Templates based on Condition
                            
                                Thread safety of std::random_device
                            
                                How can I retrieve an SD card's serial number in Android 7.0+?
                            
                                How do I update golang from command prompt on Windows?
                            
                                StyleSheet.create "not covered by flow"
                            
                                Running npm `postinstall` script after the original `npm install` finishes? (Maybe by using a Node trick?)
                            
                                Nesting concurrent.futures.ThreadPoolExecutor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With