Why does restricting GC to 1 thread increase performance?

Tags:

I have some simple java code that I wrote to artificially use a lot of RAM and I find that when I get the associated times when I use these flags:

1029.59 seconds .... -Xmx8g -Xms256m
696.44 seconds ..... -XX:ParallelGCThreads=1  -Xmx8g -Xms256m
247.27 seconds ..... -XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC  -Xmx8g -Xms256m

Now, I understand why -XX:+UseConcMarkSweepGC increases performance, but why do I get speedup when I restrict to single threaded GC? Is this an artifact of my poorly written java code or is this something which would apply to properly optimized java as well?

Here is my code:

import java.io.*;

class xdriver {
  static int N = 100;
  static double pi = 3.141592653589793;
  static double one = 1.0;
  static double two = 2.0;

  public static void main(String[] args) {
    //System.out.println("Program has started successfully\n");

    if( args.length == 1) {
      // assume that args[0] is an integer
      N = Integer.parseInt(args[0]);
    }   

    // maybe we can get user input later on this ...
    int nr = N;
    int nt = N;
    int np = 2*N;

    double dr = 1.0/(double)(nr-1);
    double dt = pi/(double)(nt-1);
    double dp = (two*pi)/(double)(np-1);

    System.out.format("nn --> %d\n", nr*nt*np);

    if(nr*nt*np < 0) {
      System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n", (long)( (long)nr*(long)nt*(long)np), nr*nt*np);
      System.exit(1);
    }   

    // inserted to artificially blow up RAM
    double[][] dels = new double [nr*nt*np][3];

    double[] rs = new double[nr];
    double[] ts = new double[nt];
    double[] ps = new double[np];

    for(int ir = 0; ir < nr; ir++) {
      rs[ir] = dr*(double)(ir);
    }   
    for(int it = 0; it < nt; it++) {
      ts[it] = dt*(double)(it);
    }   
    for(int ip = 0; ip < np; ip++) {
      ps[ip] = dp*(double)(ip);
    }   

    double C = (4.0/3.0)*pi;
    C = one/C;

    double fint = 0.0;
    int ii = 0;
    for(int ir = 0; ir < nr; ir++) {
      double r = rs[ir];
      double r2dr = r*r*dr;
      for(int it = 0; it < nt; it++) {
        double t = ts[it];
        double sint = Math.sin(t);
        for(int ip = 0; ip < np; ip++) {
          fint += C*r2dr*sint*dt*dp;

          dels[ii][0] = dr; 
          dels[ii][1] = dt; 
          dels[ii][2] = dp; 
        }   
      }   
    }   

    System.out.format("N ........ %d\n", N);
    System.out.format("fint ..... %15.10f\n", fint);
    System.out.format("err ...... %15.10f\n", Math.abs(1.0-fint));
  }
}

508

asked May 01 '15 16:05

drjrm3

1 Answers

I am not an expert on garbage collectors, so this is probably not the answer you'd like to get, but maybe my findings on your issue are interesting nevertheless.

First of all, I've changed your code into an JUnit test case. Then I've added the JUnitBenchmarks extension from Carrot Search Labs. It runs JUnit test cases multiple times, measures runtime, and outputs some performance statistics. Most important is the fact that JUnitBenchMarks does 'warmup', i.e. it runs the code several times before actually doing measurement.

The final code I've run:

import com.carrotsearch.junitbenchmarks.AbstractBenchmark;
import com.carrotsearch.junitbenchmarks.BenchmarkOptions;
import com.carrotsearch.junitbenchmarks.annotation.BenchmarkHistoryChart;
import com.carrotsearch.junitbenchmarks.annotation.LabelType;

@BenchmarkOptions(benchmarkRounds = 10, warmupRounds = 5)
@BenchmarkHistoryChart(labelWith = LabelType.CUSTOM_KEY, maxRuns = 20)
public class XDriverTest extends AbstractBenchmark {
    static int N = 200;
    static double pi = 3.141592653589793;
    static double one = 1.0;
    static double two = 2.0;

    @org.junit.Test
    public void test() {
        // System.out.println("Program has started successfully\n");
        // maybe we can get user input later on this ...
        int nr = N;
        int nt = N;
        int np = 2 * N;

        double dr = 1.0 / (double) (nr - 1);
        double dt = pi / (double) (nt - 1);
        double dp = (two * pi) / (double) (np - 1);

        System.out.format("nn --> %d\n", nr * nt * np);

        if (nr * nt * np < 0) {
            System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n",
                    (long) ((long) nr * (long) nt * (long) np), nr * nt * np);
            System.exit(1);
        }

        // inserted to artificially blow up RAM
        double[][] dels = new double[nr * nt * np][4];

        double[] rs = new double[nr];
        double[] ts = new double[nt];
        double[] ps = new double[np];

        for (int ir = 0; ir < nr; ir++) {
            rs[ir] = dr * (double) (ir);
        }
        for (int it = 0; it < nt; it++) {
            ts[it] = dt * (double) (it);
        }
        for (int ip = 0; ip < np; ip++) {
            ps[ip] = dp * (double) (ip);
        }

        double C = (4.0 / 3.0) * pi;
        C = one / C;

        double fint = 0.0;
        int ii = 0;
        for (int ir = 0; ir < nr; ir++) {
            double r = rs[ir];
            double r2dr = r * r * dr;
            for (int it = 0; it < nt; it++) {
                double t = ts[it];
                double sint = Math.sin(t);
                for (int ip = 0; ip < np; ip++) {
                    fint += C * r2dr * sint * dt * dp;

                    dels[ii][0] = dr;
                    dels[ii][5] = dt;
                    dels[ii][6] = dp;
                }
            }
        }

        System.out.format("N ........ %d\n", N);
        System.out.format("fint ..... %15.10f\n", fint);
        System.out.format("err ...... %15.10f\n", Math.abs(1.0 - fint));
    }
}

As you can see from the benchmark options @BenchmarkOptions(benchmarkRounds = 10, warmupRounds = 5), warmup is done by running the test method 5 times, afterwards the actual benchmark is run 10 times.

Then I run the program above with several different GC options (each with general heap settings of -Xmx1g -Xms256m):

default (no special options)
-XX:ParallelGCThreads=1 -Xmx1g -Xms256m
-XX:ParallelGCThreads=2 -Xmx1g -Xms256m
-XX:ParallelGCThreads=4 -Xmx1g -Xms256m
-XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=2 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m

In order to get a summary with chart as HTML page, the following VM arguments have been passed in addition to the GC settings mentioned above:

-Djub.consumers=CONSOLE,H2 -Djub.db.file=.benchmarks
-Djub.customkey=[CUSTOM_KEY]

(Where [CUSTOM_KEY] must be a string that uniquely identifies each benchmark run, e.g. defaultGC or ParallelGCThreads=1. It is used as label on the axis of the chart).

The following chart summarizes the results:

enter image description here

Run Custom key          Timestamp                   test
1   defaultGC           2015-05-01 19:43:53.796     10.721
2   ParallelGCThreads=1 2015-05-01 19:51:07.79       8.770
3   ParallelGCThreads=2 2015-05-01 19:56:44.985      8.737
4   ParallelGCThreads=4 2015-05-01 20:01:30.071     10.415
5   UseConcMarkSweepGC  2015-05-01 20:03:54.474      2.683
6   UseCCMS,Threads=1   2015-05-01 20:10:48.504      3.856
7   UseCCMS,Threads=2   2015-05-01 20:12:58.624      3.861
8   UseCCMS,Threads=4   2015-05-01 20:13:58.94       2.701

System info: CPU: Intel Core 2 Quad Q9400, 2.66 GHz, RAM: 4.00 GB, OS: Windows 8.1 x64, JVM: 1.8.0_05-b13.

(Note that the individual benchmark runs output more detailled information like standard derivation GC calls and time; unfortunately this information is not available in the summary).

Interpretation

As you can see, there is a huge performance gain when -XX:+UseConcMarkSweepGC is enabled. The number of threads do not influence the performance that much, and it depends on the general GC strategy if more threads are advantageous or not. The default GC seems to profit from two or three threads, but performance gets worse if four threads are used.

In opposite, ConcurrentMarkSweep GC with four threads is more performant than with one or two threads.

So in general, we can't say that more GC threads make performance worse.

Note that I don't know, how many GC threads are used when the default GC or ConcurrentMarkSweep GC are used without specifying the number of threads.

answered Nov 15 '22 17:11

isnot2bad

Related questions
                            
                                Can Java help me avoid boilerplate code in equals()?
                            
                                RxJava onErrorResumeNext()
                            
                                How does a classloader load classes reference in the manifest classpath?
                            
                                Query on usage of this variable in Recursion
                            
                                Travis CI not using extra Maven repository provided in pom.xml
                            
                                should MAIN method copy input arguments?
                            
                                javax.validation: Constraint to validate a string length in bytes
                            
                                Lazy way to convert log4j.xml to log4j2.xml [closed]
                            
                                How to insert a updatable record with JSON column in PostgreSQL using JOOQ?
                            
                                How to run GDAL (ogr2ogr) in Java to convert Shapefiles to GeoJSON
                            
                                startManagingCursor(cursor) deprecated method
                            
                                How to stream response body with apache HttpClient
                            
                                Where is the load barrier for the volatile statement?
                            
                                SBT: pre-applying input to inputKeys
                            
                                How to use versions-maven-plugin to set child module versions?
                            
                                java 8u31 plugin causes signed applets to load much slower
                            
                                Database connections not closed after idle-timeout in wildfly Datasource
                            
                                Inconsistency in Java 8 method signatures
                            
                                Reading a progressively encoded 9000x9000 JPEG in Java takes 1 minute
                            
                                Constants and properties in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does restricting GC to 1 thread increase performance?

Tags:

java

multithreading

garbage-collection

drjrm3

People also ask

1 Answers

isnot2bad

Recent Activity

Donate For Us