I have some simple java code that I wrote to artificially use a lot of RAM and I find that when I get the associated times when I use these flags:
1029.59 seconds .... -Xmx8g -Xms256m
696.44 seconds ..... -XX:ParallelGCThreads=1 -Xmx8g -Xms256m
247.27 seconds ..... -XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC -Xmx8g -Xms256m
Now, I understand why -XX:+UseConcMarkSweepGC
increases performance, but why do I get speedup when I restrict to single threaded GC? Is this an artifact of my poorly written java code or is this something which would apply to properly optimized java as well?
Here is my code:
import java.io.*;
class xdriver {
static int N = 100;
static double pi = 3.141592653589793;
static double one = 1.0;
static double two = 2.0;
public static void main(String[] args) {
//System.out.println("Program has started successfully\n");
if( args.length == 1) {
// assume that args[0] is an integer
N = Integer.parseInt(args[0]);
}
// maybe we can get user input later on this ...
int nr = N;
int nt = N;
int np = 2*N;
double dr = 1.0/(double)(nr-1);
double dt = pi/(double)(nt-1);
double dp = (two*pi)/(double)(np-1);
System.out.format("nn --> %d\n", nr*nt*np);
if(nr*nt*np < 0) {
System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n", (long)( (long)nr*(long)nt*(long)np), nr*nt*np);
System.exit(1);
}
// inserted to artificially blow up RAM
double[][] dels = new double [nr*nt*np][3];
double[] rs = new double[nr];
double[] ts = new double[nt];
double[] ps = new double[np];
for(int ir = 0; ir < nr; ir++) {
rs[ir] = dr*(double)(ir);
}
for(int it = 0; it < nt; it++) {
ts[it] = dt*(double)(it);
}
for(int ip = 0; ip < np; ip++) {
ps[ip] = dp*(double)(ip);
}
double C = (4.0/3.0)*pi;
C = one/C;
double fint = 0.0;
int ii = 0;
for(int ir = 0; ir < nr; ir++) {
double r = rs[ir];
double r2dr = r*r*dr;
for(int it = 0; it < nt; it++) {
double t = ts[it];
double sint = Math.sin(t);
for(int ip = 0; ip < np; ip++) {
fint += C*r2dr*sint*dt*dp;
dels[ii][0] = dr;
dels[ii][1] = dt;
dels[ii][2] = dp;
}
}
}
System.out.format("N ........ %d\n", N);
System.out.format("fint ..... %15.10f\n", fint);
System.out.format("err ...... %15.10f\n", Math.abs(1.0-fint));
}
}
Other than some additional disk I/O activity for writing the log files, enabling garbage collection logging does not significantly affect server performance.
They all block the running threads occasionally when they need to do global GC. The concurrent GC does try to do most of its work in a concurrent fashion though.
This can be achieved by lowering '-XX:InitiatingHeapOccupancyPercent' value. Default value is 45. It means the G1 GC marking phase will begin only when heap usage reaches 45%. By lowering the value, the G1 GC marking phase will get triggered earlier so that Full GC can be avoided.
The reason that a Full GC occurs is because the application allocates too many objects that can't be reclaimed quickly enough. Often concurrent marking has not been able to complete in time to start a space-reclamation phase.
I am not an expert on garbage collectors, so this is probably not the answer you'd like to get, but maybe my findings on your issue are interesting nevertheless.
First of all, I've changed your code into an JUnit test case. Then I've added the JUnitBenchmarks extension from Carrot Search Labs. It runs JUnit test cases multiple times, measures runtime, and outputs some performance statistics. Most important is the fact that JUnitBenchMarks does 'warmup', i.e. it runs the code several times before actually doing measurement.
The final code I've run:
import com.carrotsearch.junitbenchmarks.AbstractBenchmark;
import com.carrotsearch.junitbenchmarks.BenchmarkOptions;
import com.carrotsearch.junitbenchmarks.annotation.BenchmarkHistoryChart;
import com.carrotsearch.junitbenchmarks.annotation.LabelType;
@BenchmarkOptions(benchmarkRounds = 10, warmupRounds = 5)
@BenchmarkHistoryChart(labelWith = LabelType.CUSTOM_KEY, maxRuns = 20)
public class XDriverTest extends AbstractBenchmark {
static int N = 200;
static double pi = 3.141592653589793;
static double one = 1.0;
static double two = 2.0;
@org.junit.Test
public void test() {
// System.out.println("Program has started successfully\n");
// maybe we can get user input later on this ...
int nr = N;
int nt = N;
int np = 2 * N;
double dr = 1.0 / (double) (nr - 1);
double dt = pi / (double) (nt - 1);
double dp = (two * pi) / (double) (np - 1);
System.out.format("nn --> %d\n", nr * nt * np);
if (nr * nt * np < 0) {
System.out.format("ERROR: nr*nt*np = %d(long) which is %d(int)\n",
(long) ((long) nr * (long) nt * (long) np), nr * nt * np);
System.exit(1);
}
// inserted to artificially blow up RAM
double[][] dels = new double[nr * nt * np][4];
double[] rs = new double[nr];
double[] ts = new double[nt];
double[] ps = new double[np];
for (int ir = 0; ir < nr; ir++) {
rs[ir] = dr * (double) (ir);
}
for (int it = 0; it < nt; it++) {
ts[it] = dt * (double) (it);
}
for (int ip = 0; ip < np; ip++) {
ps[ip] = dp * (double) (ip);
}
double C = (4.0 / 3.0) * pi;
C = one / C;
double fint = 0.0;
int ii = 0;
for (int ir = 0; ir < nr; ir++) {
double r = rs[ir];
double r2dr = r * r * dr;
for (int it = 0; it < nt; it++) {
double t = ts[it];
double sint = Math.sin(t);
for (int ip = 0; ip < np; ip++) {
fint += C * r2dr * sint * dt * dp;
dels[ii][0] = dr;
dels[ii][5] = dt;
dels[ii][6] = dp;
}
}
}
System.out.format("N ........ %d\n", N);
System.out.format("fint ..... %15.10f\n", fint);
System.out.format("err ...... %15.10f\n", Math.abs(1.0 - fint));
}
}
As you can see from the benchmark options @BenchmarkOptions(benchmarkRounds = 10, warmupRounds = 5)
, warmup is done by running the test method 5 times, afterwards the actual benchmark is run 10 times.
Then I run the program above with several different GC options (each with general heap settings of -Xmx1g -Xms256m
):
-XX:ParallelGCThreads=1 -Xmx1g -Xms256m
-XX:ParallelGCThreads=2 -Xmx1g -Xms256m
-XX:ParallelGCThreads=4 -Xmx1g -Xms256m
-XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=1 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=2 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
-XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -Xmx1g -Xms256m
In order to get a summary with chart as HTML page, the following VM arguments have been passed in addition to the GC settings mentioned above:
-Djub.consumers=CONSOLE,H2 -Djub.db.file=.benchmarks
-Djub.customkey=[CUSTOM_KEY]
(Where [CUSTOM_KEY]
must be a string that uniquely identifies each benchmark run, e.g. defaultGC
or ParallelGCThreads=1
. It is used as label on the axis of the chart).
The following chart summarizes the results:
Run Custom key Timestamp test
1 defaultGC 2015-05-01 19:43:53.796 10.721
2 ParallelGCThreads=1 2015-05-01 19:51:07.79 8.770
3 ParallelGCThreads=2 2015-05-01 19:56:44.985 8.737
4 ParallelGCThreads=4 2015-05-01 20:01:30.071 10.415
5 UseConcMarkSweepGC 2015-05-01 20:03:54.474 2.683
6 UseCCMS,Threads=1 2015-05-01 20:10:48.504 3.856
7 UseCCMS,Threads=2 2015-05-01 20:12:58.624 3.861
8 UseCCMS,Threads=4 2015-05-01 20:13:58.94 2.701
System info: CPU: Intel Core 2 Quad Q9400, 2.66 GHz, RAM: 4.00 GB, OS: Windows 8.1 x64, JVM: 1.8.0_05-b13.
(Note that the individual benchmark runs output more detailled information like standard derivation GC calls and time; unfortunately this information is not available in the summary).
Interpretation
As you can see, there is a huge performance gain when -XX:+UseConcMarkSweepGC
is enabled. The number of threads do not influence the performance that much, and it depends on the general GC strategy if more threads are advantageous or not. The default GC seems to profit from two or three threads, but performance gets worse if four threads are used.
In opposite, ConcurrentMarkSweep GC with four threads is more performant than with one or two threads.
So in general, we can't say that more GC threads make performance worse.
Note that I don't know, how many GC threads are used when the default GC or ConcurrentMarkSweep GC are used without specifying the number of threads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With