Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does restricting GC to 1 thread increase performance?

I have some simple java code that I wrote to artificially use a lot of RAM and I find that when I get the associated times when I use these flags:

1029.59 seconds .... -Xmx8g -Xms256m
696.44 seconds ..... -XX:ParallelGCThreads=1  -Xmx8g -Xms256m
247.27 seconds ..... -XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC  -Xmx8g -Xms256m

Now, I understand why -XX:+UseConcMarkSweepGC increases performance, but why do I get speedup when I restrict to single threaded GC? Is this an artifact of my poorly written java code or is this something which would apply to properly optimized java as well?

Here is my code:

import java.io.*;

class xdriver {
  static int N = 100;
  static double pi = 3.141592653589793;
  static double one = 1.0;
  static double two = 2.0;

  public static void main(String[] args) {
    //System.out.println("Program has started successfully\n");

    if( args.length == 1) {
      // assume that args[0] is an integer
      N = Integer.parseInt(args[0]);
    }   

    // maybe we can get user input later on this ...
    int nr = N;
    int nt = N;
    int np = 2*N;

    double dr = 1.0/(double)(nr-1);
    double dt = pi/(double)(nt-1);
    double dp = (two*pi)/(double)(np-1);

    System.out.format("nn --> %d\n", nr*nt*np);

    if(nr*nt*np < 0) {
      System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n", (long)( (long)nr*(long)nt*(long)np), nr*nt*np);
      System.exit(1);
    }   

    // inserted to artificially blow up RAM
    double[][] dels = new double [nr*nt*np][3];

    double[] rs = new double[nr];
    double[] ts = new double[nt];
    double[] ps = new double[np];

    for(int ir = 0; ir < nr; ir++) {
      rs[ir] = dr*(double)(ir);
    }   
    for(int it = 0; it < nt; it++) {
      ts[it] = dt*(double)(it);
    }   
    for(int ip = 0; ip < np; ip++) {
      ps[ip] = dp*(double)(ip);
    }   

    double C = (4.0/3.0)*pi;
    C = one/C;

    double fint = 0.0;
    int ii = 0;
    for(int ir = 0; ir < nr; ir++) {
      double r = rs[ir];
      double r2dr = r*r*dr;
      for(int it = 0; it < nt; it++) {
        double t = ts[it];
        double sint = Math.sin(t);
        for(int ip = 0; ip < np; ip++) {
          fint += C*r2dr*sint*dt*dp;

          dels[ii][0] = dr; 
          dels[ii][1] = dt; 
          dels[ii][2] = dp; 
        }   
      }   
    }   

    System.out.format("N ........ %d\n", N);
    System.out.format("fint ..... %15.10f\n", fint);
    System.out.format("err ...... %15.10f\n", Math.abs(1.0-fint));
  }
}
like image 508
drjrm3 Avatar asked May 01 '15 16:05

drjrm3


People also ask

Does GC logging affect performance?

Other than some additional disk I/O activity for writing the log files, enabling garbage collection logging does not significantly affect server performance.

Does garbage collection block threads?

They all block the running threads occasionally when they need to do global GC. The concurrent GC does try to do most of its work in a concurrent fashion though.

How to avoid Full GC?

This can be achieved by lowering '-XX:InitiatingHeapOccupancyPercent' value. Default value is 45. It means the G1 GC marking phase will begin only when heap usage reaches 45%. By lowering the value, the G1 GC marking phase will get triggered earlier so that Full GC can be avoided.

What triggers a Full GC?

The reason that a Full GC occurs is because the application allocates too many objects that can't be reclaimed quickly enough. Often concurrent marking has not been able to complete in time to start a space-reclamation phase.


1 Answers

I am not an expert on garbage collectors, so this is probably not the answer you'd like to get, but maybe my findings on your issue are interesting nevertheless.

First of all, I've changed your code into an JUnit test case. Then I've added the JUnitBenchmarks extension from Carrot Search Labs. It runs JUnit test cases multiple times, measures runtime, and outputs some performance statistics. Most important is the fact that JUnitBenchMarks does 'warmup', i.e. it runs the code several times before actually doing measurement.

The final code I've run:

import com.carrotsearch.junitbenchmarks.AbstractBenchmark;
import com.carrotsearch.junitbenchmarks.BenchmarkOptions;
import com.carrotsearch.junitbenchmarks.annotation.BenchmarkHistoryChart;
import com.carrotsearch.junitbenchmarks.annotation.LabelType;

@BenchmarkOptions(benchmarkRounds = 10, warmupRounds = 5)
@BenchmarkHistoryChart(labelWith = LabelType.CUSTOM_KEY, maxRuns = 20)
public class XDriverTest extends AbstractBenchmark {
    static int N = 200;
    static double pi = 3.141592653589793;
    static double one = 1.0;
    static double two = 2.0;

    @org.junit.Test
    public void test() {
        // System.out.println("Program has started successfully\n");
        // maybe we can get user input later on this ...
        int nr = N;
        int nt = N;
        int np = 2 * N;

        double dr = 1.0 / (double) (nr - 1);
        double dt = pi / (double) (nt - 1);
        double dp = (two * pi) / (double) (np - 1);

        System.out.format("nn --> %d\n", nr * nt * np);

        if (nr * nt * np < 0) {
            System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n",
                    (long) ((long) nr * (long) nt * (long) np), nr * nt * np);
            System.exit(1);
        }

        // inserted to artificially blow up RAM
        double[][] dels = new double[nr * nt * np][4];

        double[] rs = new double[nr];
        double[] ts = new double[nt];
        double[] ps = new double[np];

        for (int ir = 0; ir < nr; ir++) {
            rs[ir] = dr * (double) (ir);
        }
        for (int it = 0; it < nt; it++) {
            ts[it] = dt * (double) (it);
        }
        for (int ip = 0; ip < np; ip++) {
            ps[ip] = dp * (double) (ip);
        }

        double C = (4.0 / 3.0) * pi;
        C = one / C;

        double fint = 0.0;
        int ii = 0;
        for (int ir = 0; ir < nr; ir++) {
            double r = rs[ir];
            double r2dr = r * r * dr;
            for (int it = 0; it < nt; it++) {
                double t = ts[it];
                double sint = Math.sin(t);
                for (int ip = 0; ip < np; ip++) {
                    fint += C * r2dr * sint * dt * dp;

                    dels[ii][0] = dr;
                    dels[ii][5] = dt;
                    dels[ii][6] = dp;
                }
            }
        }

        System.out.format("N ........ %d\n", N);
        System.out.format("fint ..... %15.10f\n", fint);
        System.out.format("err ...... %15.10f\n", Math.abs(1.0 - fint));
    }
}

As you can see from the benchmark options @BenchmarkOptions(benchmarkRounds = 10, warmupRounds = 5), warmup is done by running the test method 5 times, afterwards the actual benchmark is run 10 times.

Then I run the program above with several different GC options (each with general heap settings of -Xmx1g -Xms256m):

  • default (no special options)
  • -XX:ParallelGCThreads=1 -Xmx1g -Xms256m
  • -XX:ParallelGCThreads=2 -Xmx1g -Xms256m
  • -XX:ParallelGCThreads=4 -Xmx1g -Xms256m
  • -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
  • -XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
  • -XX:ParallelGCThreads=2 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
  • -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m

In order to get a summary with chart as HTML page, the following VM arguments have been passed in addition to the GC settings mentioned above:

-Djub.consumers=CONSOLE,H2 -Djub.db.file=.benchmarks
-Djub.customkey=[CUSTOM_KEY]

(Where [CUSTOM_KEY] must be a string that uniquely identifies each benchmark run, e.g. defaultGC or ParallelGCThreads=1. It is used as label on the axis of the chart).

The following chart summarizes the results:

enter image description here

Run Custom key          Timestamp                   test
1   defaultGC           2015-05-01 19:43:53.796     10.721
2   ParallelGCThreads=1 2015-05-01 19:51:07.79       8.770
3   ParallelGCThreads=2 2015-05-01 19:56:44.985      8.737
4   ParallelGCThreads=4 2015-05-01 20:01:30.071     10.415
5   UseConcMarkSweepGC  2015-05-01 20:03:54.474      2.683
6   UseCCMS,Threads=1   2015-05-01 20:10:48.504      3.856
7   UseCCMS,Threads=2   2015-05-01 20:12:58.624      3.861
8   UseCCMS,Threads=4   2015-05-01 20:13:58.94       2.701

System info: CPU: Intel Core 2 Quad Q9400, 2.66 GHz, RAM: 4.00 GB, OS: Windows 8.1 x64, JVM: 1.8.0_05-b13.

(Note that the individual benchmark runs output more detailled information like standard derivation GC calls and time; unfortunately this information is not available in the summary).

Interpretation

As you can see, there is a huge performance gain when -XX:+UseConcMarkSweepGC is enabled. The number of threads do not influence the performance that much, and it depends on the general GC strategy if more threads are advantageous or not. The default GC seems to profit from two or three threads, but performance gets worse if four threads are used.

In opposite, ConcurrentMarkSweep GC with four threads is more performant than with one or two threads.

So in general, we can't say that more GC threads make performance worse.

Note that I don't know, how many GC threads are used when the default GC or ConcurrentMarkSweep GC are used without specifying the number of threads.

like image 98
isnot2bad Avatar answered Nov 15 '22 17:11

isnot2bad