Observe the following definition of a thread subclass (the entire runnable Java source file is included at the end of the question for your convenience): <pre class="prettyprint"><code>final class Worker extends Thread { Foo[] array = new Foo[1024]; int sz; public Worker(int _sz) { sz = _sz; } public void run() { //Foo[] arr = new Foo[1024]; Foo[] arr = array; loop(arr); } public void loop(Foo[] arr) { int i = 0; int pos = 512; Foo v = new Foo(); while (i < sz) { if (i % 2 == 0) { arr[pos] = v; pos += 1; } else { pos -= 1; v = arr[pos]; } i++; } } } </code></pre> Explanation: The program starts <code>-Dpar</code> such threads, and sets the <code>sz</code> of each thread to <code>-Dsize / -Dpar</code>, where <code>-Dsize</code> and <code>-Dpar</code> are set through the command line when running the program. Each thread object has a field <code>array</code> which is initialized with a fresh <code>1024</code>-element array. The reasoning is that we want to divide an equal amount of work between a different number of threads - we expect the program to scale. Each thread is then started and the time needed for all the threads to complete is measured. We do multiple measurements to counter any JIT related effects, as shown below. Each thread does a loop. Within the loop, the thread reads an element at the position <code>512</code> in the array in even iterations, and writes the same element at <code>512</code> in odd iterations. Only local variables are modified otherwise. Full program is below. Analysis: Tested with <code>-verbose:gc</code> - there is no garbage collection occurring during the run of this program. Run command: <pre class="prettyprint"><code>java -Xmx512m -Xms512m -server -Dsize=500000000 -Dpar=1 org.scalapool.bench.MultiStackJavaExperiment 7 </code></pre> CASE 1: Running times for <code>1,2,4,8</code> threads, in that order (7 repetitions): <pre class="prettyprint"><code>>>> All running times: [2149, 2227, 1974, 1948, 1803, 2283, 1878] >>> All running times: [1140, 1124, 2022, 1141, 2028, 2004, 2136] >>> All running times: [867, 1022, 1457, 1342, 1436, 966, 1531] >>> All running times: [915, 864, 1245, 1243, 948, 790, 1007] </code></pre> My thought was that the nonlinear scaling is due to memory contention. Incidentally, early iterations actually do better - this might be due to the fact that in different iterations the arrays are allocated in different memory areas. CASE 2: Next, I comment the <code>Foo[] arr = array</code> line in the <code>run</code> method of the thread and allocate a new array in the <code>run</code> method itself: <code>Foo[] arr = new Foo[1024]</code>. Measurements: <pre class="prettyprint"><code>>>> All running times: [2053, 1966, 2089, 1937, 2046, 1909, 2011] >>> All running times: [1048, 1178, 1100, 1194, 1367, 1271, 1207] >>> All running times: [578, 508, 589, 571, 617, 643, 645] >>> All running times: [330, 299, 300, 322, 331, 324, 575] </code></pre> This time, everything scales pretty much as expected. I wouldn't have imagined that the location where the array was allocated plays any role whatsoever, but obviously it does somehow. My thought was that the arrays were previously allocated so close to each other that some memory contention started happening. CASE 3: To verify this assumption, I've uncommented the line <code>Foo[] arr = array</code> again, but this time initialized the <code>array</code> field to <code>new Foo[32000]</code> to ensure that the location in memory being written to are sufficiently far from each other. So, here we're using the array allocated during the creation of the thread object again, the difference with CASE1 is only that the array is bigger. <pre class="prettyprint"><code>>>> All running times: [2113, 1983, 2430, 2485, 2333, 2359, 2463] >>> All running times: [1172, 1106, 1163, 1181, 1142, 1169, 1188] >>> All running times: [578, 677, 614, 604, 583, 637, 597] >>> All running times: [343, 327, 320, 330, 353, 320, 320] </code></pre> So, memory contention seems to be the cause of this. The platform information: <pre class="prettyprint"><code>Ubuntu Server 10.04.3 LTS 8 core Intel(R) Xeon(R) CPU X5355 @2.66GHz ~20GB ram java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) </code></pre> Question: This is obviously a memory-contention issue. But why is this happening? <ol> <li>Is the escape analysis kicking in? If so, does it mean that the entire array is allocated on the stack when created in the <code>run</code> method in CASE2? What are the exact conditions for this runtime optimization? Surely the array is not allocated on the stack for 1 million elements?</li> <li>Even if the array is being allocated on the stack as opposed to being allocated on the heap, two array accesses by different threads should be divided by at least 512 * 4bytes = 2kb even in CASE1, wherever the arrays are! That's definitely larger than any L1 cache-line. If these effects are due to false sharing, how can writes to several totally independent cache-lines affect performance this much? (One assumption here is that each array occupies a contiguous block of memory on the JVM, which is allocated when the array is created. I'm not sure this is valid. Another assumption is that array writes don't go all the way to memory, but L1 cache instead, as Intel Xeon does have a ccNUMA architecture - correct me if I'm wrong)</li> <li>Is it possible that each thread has its own local heap part where it independently allocates new objects, and this is the cause for lower contention when the array is allocated in the thread? If so, how is that area of heap garbage collected if references get shared?</li> <li>Why has increasing the array size to ~32000 elements improved scalability (decreased memory contention)? What exactly in the memory hierarchy is the cause of this?</li> </ol> Please be precise and support your claims with references. Thank you! <hr> The entire runnable Java program: <pre class="prettyprint"><code>import java.util.ArrayList; class MultiStackJavaExperiment { final class Foo { int x = 0; } final class Worker extends Thread { Foo[] array = new Foo[1024]; int sz; public Worker(int _sz) { sz = _sz; } public void run() { Foo[] arr = new Foo[1024]; //Foo[] arr = array; loop(arr); } public void loop(Foo[] arr) { int i = 0; int pos = 512; Foo v = new Foo(); while (i < sz) { if (i % 2 == 0) { arr[pos] = v; pos += 1; } else { pos -= 1; v = arr[pos]; } i++; } } } public static void main(String[] args) { (new MultiStackJavaExperiment()).mainMethod(args); } int size = Integer.parseInt(System.getProperty("size")); int par = Integer.parseInt(System.getProperty("par")); public void mainMethod(String[] args) { int times = 0; if (args.length == 0) times = 1; else times = Integer.parseInt(args[0]); ArrayList < Long > measurements = new ArrayList < Long > (); for (int i = 0; i < times; i++) { long start = System.currentTimeMillis(); run(); long end = System.currentTimeMillis(); long time = (end - start); System.out.println(i + ") Running time: " + time + " ms"); measurements.add(time); } System.out.println(">>>"); System.out.println(">>> All running times: " + measurements); System.out.println(">>>"); } public void run() { int sz = size / par; ArrayList < Thread > threads = new ArrayList < Thread > (); for (int i = 0; i < par; i++) { threads.add(new Worker(sz)); threads.get(i).start(); } for (int i = 0; i < par; i++) { try { threads.get(i).join(); } catch (Exception e) {} } } } </code></pre>

Solution Run the JVM with the <code>-XX:+UseCondCardMark</code> flag, available only in JDK7. This solves the problem. Explanation Essentially, most managed-heap environments use card tables to mark the areas of memory into which writes occurred. Such memory areas are marked as dirty in the card table once the write occurs. This information is needed for garbage collection - references of the non-dirty memory areas don't have to be scanned. A card is a contiguous block of memory, typically 512 bytes. A card table typically has 1 byte for each card - if this byte is set, the card is dirty. This means that a card table with 64 bytes covers 64 * 512 bytes of memory. And typically, the cache line size today is 64 bytes. So each time a write to an object field occurs, the byte of the corresponding card in the card table must be set as dirty. A useful optimization in single thread programs is to do this by simply marking the relevant byte - do a write each time. An alternative of first checking whether the byte is set and a conditional write requires an additional read and a conditional jump, which is slightly slower. However, this optimization can be catastrophic in the event that there are multiple processors writing to the memory, as neighbouring cards being written to require a write to neighbouring bytes in the card table. So the memory area being written to (the entry in the array above) is not in the same cache-line, which is the usual cause of memory contention. The real reason is that the dirty bytes which are written to are in the same cache line. What the above flag does is - it implements the card table dirty byte write by first checking if the byte is already set, and setting it only if it isn't. This way the memory contention happens only during the first write to that card - after that, only reads to that cache-line occur. Since the cache-line is only read, it can be replicated across multiple processors and they don't have to synchronize to read it. I've observed that this flag increases the running time some 15-20% in the 1-thread case. The <code>-XX:+UseCondCardMark</code> flag is explained in this blog post and this bug report. The relevant concurrency mailing list discussion: Array allocation and access on the JVM.

I believe you need to reduce your code so its not doing lots of incidental things which could be confusing matters. After reducing the code it is clear to me that you are only accessing the same array location every time. i.e. position 512. If you minimise your code, reuse your threads so you are not stop/starting them you get much more reproducible results. <pre class="prettyprint"><code>import java.util.ArrayList; import java.util.Arrays; import java.util.List; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; public class MultiStackJavaExperiment { static final int size = Integer.getInteger("size", 500000000); public static void main(String... args) throws ExecutionException, InterruptedException { int par = 8; for (int s = 64; s <= 64 * 1024; s *= 2) { int times = args.length == 0 ? 1 : Integer.parseInt(args[0]); long[] measurements = new long[times]; ExecutorService es = Executors.newFixedThreadPool(par); List<Future<?>> futures = new ArrayList<Future<?>>(times); for (int i = 0; i < times; i++) { long start = System.currentTimeMillis(); final int sz = size / par; futures.clear(); for (int j = 0; j < par; j++) { final Object[] arr = new Object[s]; futures.add(es.submit(new Runnable() { @Override public void run() { final int bits = 7, arraySize = 1 << bits; int i = 0; int pos = 32; Object v = new Object(); while (i < sz) { if (i % 2 == 0) { arr[pos] = v; pos += 1; } else { pos -= 1; v = arr[pos]; } i++; } } })); } for (Future<?> future : futures) future.get(); long time = System.currentTimeMillis() - start; // System.out.println(i + ") Running time: " + time + " ms"); measurements[i] = time; } es.shutdown(); System.out.println("par = " + par + " arr.length= "+ s + " >>> All running times: " + Arrays.toString(measurements)); } } } </code></pre> this shows the distance between access values matters. By allocating an array is each thread, you use different TLABs (which space out the data in blocks) <pre class="prettyprint"><code>par = 8 arr.length= 64 >>> All running times: [539, 413, 444, 444, 457, 444, 456] par = 8 arr.length= 256 >>> All running times: [398, 527, 514, 529, 445, 441, 445] par = 8 arr.length= 1024 >>> All running times: [419, 507, 477, 422, 412, 452, 396] par = 8 arr.length= 4096 >>> All running times: [316, 282, 250, 232, 242, 229, 238] par = 8 arr.length= 16384 >>> All running times: [316, 207, 209, 212, 208, 208, 208] par = 8 arr.length= 65536 >>> All running times: [211, 211, 208, 208, 208, 291, 206] par = 8 arr.length= 262144 >>> All running times: [366, 210, 210, 210, 210, 209, 211] par = 8 arr.length= 1048576 >>> All running times: [296, 211, 215, 216, 213, 211, 211] </code></pre> if you move the array inside the thread you get <pre class="prettyprint"><code>par = 8 arr.length= 64 >>> All running times: [225, 151, 151, 150, 152, 153, 152] par = 8 arr.length= 256 >>> All running times: [155, 151, 151, 151, 151, 151, 155] par = 8 arr.length= 1024 >>> All running times: [153, 152, 151, 151, 151, 155, 152] par = 8 arr.length= 4096 >>> All running times: [155, 156, 151, 152, 151, 155, 155] par = 8 arr.length= 16384 >>> All running times: [154, 157, 152, 152, 158, 153, 153] par = 8 arr.length= 65536 >>> All running times: [155, 157, 152, 184, 181, 154, 153] par = 8 arr.length= 262144 >>> All running times: [240, 159, 166, 151, 172, 154, 160] par = 8 arr.length= 1048576 >>> All running times: [165, 162, 163, 162, 163, 162, 163] </code></pre> Turn off the tlab with <code>-XX:-UseTLAB</code> and the same code give syou <pre class="prettyprint"><code>par = 8 arr.length= 64 >>> All running times: [608, 467, 467, 457, 468, 461, 428] par = 8 arr.length= 256 >>> All running times: [437, 437, 522, 512, 522, 369, 535] par = 8 arr.length= 1024 >>> All running times: [394, 395, 475, 525, 470, 440, 478] par = 8 arr.length= 4096 >>> All running times: [347, 215, 238, 226, 236, 204, 271] par = 8 arr.length= 16384 >>> All running times: [291, 157, 178, 151, 150, 151, 152] par = 8 arr.length= 65536 >>> All running times: [163, 152, 162, 151, 159, 159, 154] par = 8 arr.length= 262144 >>> All running times: [164, 172, 152, 169, 160, 161, 160] par = 8 arr.length= 1048576 >>> All running times: [295, 153, 164, 153, 166, 154, 163] </code></pre>

Array allocation and access on the Java Virtual Machine and memory contention

Tags:

java

arrays

memory-management

memory

concurrency

Observe the following definition of a thread subclass (the entire runnable Java source file is included at the end of the question for your convenience):

final class Worker extends Thread {
    Foo[] array = new Foo[1024];
    int sz;

    public Worker(int _sz) {
        sz = _sz;
    }

    public void run() {
        //Foo[] arr = new Foo[1024];
        Foo[] arr = array;
        loop(arr);
    }

    public void loop(Foo[] arr) {
        int i = 0;
        int pos = 512;
        Foo v = new Foo();
        while (i < sz) {
            if (i % 2 == 0) {
                arr[pos] = v;
                pos += 1;
            } else {
                pos -= 1;
                v = arr[pos];
            }
            i++;
        }
    }
}

Explanation: The program starts -Dpar such threads, and sets the sz of each thread to -Dsize / -Dpar, where -Dsize and -Dpar are set through the command line when running the program. Each thread object has a field array which is initialized with a fresh 1024-element array. The reasoning is that we want to divide an equal amount of work between a different number of threads - we expect the program to scale.

Each thread is then started and the time needed for all the threads to complete is measured. We do multiple measurements to counter any JIT related effects, as shown below. Each thread does a loop. Within the loop, the thread reads an element at the position 512 in the array in even iterations, and writes the same element at 512 in odd iterations. Only local variables are modified otherwise.

Full program is below.

Analysis:

Tested with -verbose:gc - there is no garbage collection occurring during the run of this program.

Run command:

java -Xmx512m -Xms512m -server -Dsize=500000000 -Dpar=1 org.scalapool.bench.MultiStackJavaExperiment 7

CASE 1: Running times for 1,2,4,8 threads, in that order (7 repetitions):

>>> All running times: [2149, 2227, 1974, 1948, 1803, 2283, 1878]
>>> All running times: [1140, 1124, 2022, 1141, 2028, 2004, 2136]
>>> All running times: [867, 1022, 1457, 1342, 1436, 966, 1531]
>>> All running times: [915, 864, 1245, 1243, 948, 790, 1007]

My thought was that the nonlinear scaling is due to memory contention. Incidentally, early iterations actually do better - this might be due to the fact that in different iterations the arrays are allocated in different memory areas.

CASE 2: Next, I comment the Foo[] arr = array line in the run method of the thread and allocate a new array in the run method itself: Foo[] arr = new Foo[1024]. Measurements:

>>> All running times: [2053, 1966, 2089, 1937, 2046, 1909, 2011]
>>> All running times: [1048, 1178, 1100, 1194, 1367, 1271, 1207]
>>> All running times: [578, 508, 589, 571, 617, 643, 645]
>>> All running times: [330, 299, 300, 322, 331, 324, 575]

This time, everything scales pretty much as expected. I wouldn't have imagined that the location where the array was allocated plays any role whatsoever, but obviously it does somehow. My thought was that the arrays were previously allocated so close to each other that some memory contention started happening.

CASE 3: To verify this assumption, I've uncommented the line Foo[] arr = array again, but this time initialized the array field to new Foo[32000] to ensure that the location in memory being written to are sufficiently far from each other. So, here we're using the array allocated during the creation of the thread object again, the difference with CASE1 is only that the array is bigger.

>>> All running times: [2113, 1983, 2430, 2485, 2333, 2359, 2463]
>>> All running times: [1172, 1106, 1163, 1181, 1142, 1169, 1188]
>>> All running times: [578, 677, 614, 604, 583, 637, 597]
>>> All running times: [343, 327, 320, 330, 353, 320, 320]

So, memory contention seems to be the cause of this.

The platform information:

Ubuntu Server 10.04.3 LTS
8 core Intel(R) Xeon(R) CPU  X5355  @2.66GHz
~20GB ram
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

Question: This is obviously a memory-contention issue. But why is this happening?

Is the escape analysis kicking in? If so, does it mean that the entire array is allocated on the stack when created in the run method in CASE2? What are the exact conditions for this runtime optimization? Surely the array is not allocated on the stack for 1 million elements?
Even if the array is being allocated on the stack as opposed to being allocated on the heap, two array accesses by different threads should be divided by at least 512 * 4bytes = 2kb even in CASE1, wherever the arrays are! That's definitely larger than any L1 cache-line. If these effects are due to false sharing, how can writes to several totally independent cache-lines affect performance this much? (One assumption here is that each array occupies a contiguous block of memory on the JVM, which is allocated when the array is created. I'm not sure this is valid. Another assumption is that array writes don't go all the way to memory, but L1 cache instead, as Intel Xeon does have a ccNUMA architecture - correct me if I'm wrong)
Is it possible that each thread has its own local heap part where it independently allocates new objects, and this is the cause for lower contention when the array is allocated in the thread? If so, how is that area of heap garbage collected if references get shared?
Why has increasing the array size to ~32000 elements improved scalability (decreased memory contention)? What exactly in the memory hierarchy is the cause of this?

Please be precise and support your claims with references.

Thank you!

The entire runnable Java program:

import java.util.ArrayList;

class MultiStackJavaExperiment {

    final class Foo {
        int x = 0;
    }

    final class Worker extends Thread {
        Foo[] array = new Foo[1024];
        int sz;

        public Worker(int _sz) {
            sz = _sz;
        }

        public void run() {
            Foo[] arr = new Foo[1024];
            //Foo[] arr = array;
            loop(arr);
        }

        public void loop(Foo[] arr) {
            int i = 0;
            int pos = 512;
            Foo v = new Foo();
            while (i < sz) {
                if (i % 2 == 0) {
                    arr[pos] = v;
                    pos += 1;
                } else {
                    pos -= 1;
                    v = arr[pos];
                }
                i++;
            }
        }
    }

    public static void main(String[] args) {
        (new MultiStackJavaExperiment()).mainMethod(args);
    }

    int size = Integer.parseInt(System.getProperty("size"));
    int par = Integer.parseInt(System.getProperty("par"));

    public void mainMethod(String[] args) {
        int times = 0;
        if (args.length == 0) times = 1;
        else times = Integer.parseInt(args[0]);
        ArrayList < Long > measurements = new ArrayList < Long > ();

        for (int i = 0; i < times; i++) {
            long start = System.currentTimeMillis();
            run();
            long end = System.currentTimeMillis();

            long time = (end - start);
            System.out.println(i + ") Running time: " + time + " ms");
            measurements.add(time);
        }

        System.out.println(">>>");
        System.out.println(">>> All running times: " + measurements);
        System.out.println(">>>");
    }

    public void run() {
        int sz = size / par;
        ArrayList < Thread > threads = new ArrayList < Thread > ();

        for (int i = 0; i < par; i++) {
            threads.add(new Worker(sz));
            threads.get(i).start();
        }
        for (int i = 0; i < par; i++) {
            try {
                threads.get(i).join();
            } catch (Exception e) {}
        }
    }

}

751

asked Jan 20 '12 13:01

axel22

2 Answers

Solution

Run the JVM with the -XX:+UseCondCardMark flag, available only in JDK7. This solves the problem.

Explanation

Essentially, most managed-heap environments use card tables to mark the areas of memory into which writes occurred. Such memory areas are marked as dirty in the card table once the write occurs. This information is needed for garbage collection - references of the non-dirty memory areas don't have to be scanned. A card is a contiguous block of memory, typically 512 bytes. A card table typically has 1 byte for each card - if this byte is set, the card is dirty. This means that a card table with 64 bytes covers 64 * 512 bytes of memory. And typically, the cache line size today is 64 bytes.

So each time a write to an object field occurs, the byte of the corresponding card in the card table must be set as dirty. A useful optimization in single thread programs is to do this by simply marking the relevant byte - do a write each time. An alternative of first checking whether the byte is set and a conditional write requires an additional read and a conditional jump, which is slightly slower.

However, this optimization can be catastrophic in the event that there are multiple processors writing to the memory, as neighbouring cards being written to require a write to neighbouring bytes in the card table. So the memory area being written to (the entry in the array above) is not in the same cache-line, which is the usual cause of memory contention. The real reason is that the dirty bytes which are written to are in the same cache line.

What the above flag does is - it implements the card table dirty byte write by first checking if the byte is already set, and setting it only if it isn't. This way the memory contention happens only during the first write to that card - after that, only reads to that cache-line occur. Since the cache-line is only read, it can be replicated across multiple processors and they don't have to synchronize to read it.

I've observed that this flag increases the running time some 15-20% in the 1-thread case.

The -XX:+UseCondCardMark flag is explained in this blog post and this bug report.

The relevant concurrency mailing list discussion: Array allocation and access on the JVM.

answered Oct 17 '22 01:10

axel22

I believe you need to reduce your code so its not doing lots of incidental things which could be confusing matters. After reducing the code it is clear to me that you are only accessing the same array location every time. i.e. position 512.

If you minimise your code, reuse your threads so you are not stop/starting them you get much more reproducible results.

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;

public class MultiStackJavaExperiment {
    static final int size = Integer.getInteger("size", 500000000);

    public static void main(String... args) throws ExecutionException, InterruptedException {
        int par = 8;
        for (int s = 64; s <= 64 * 1024; s *= 2) {
            int times = args.length == 0 ? 1 : Integer.parseInt(args[0]);
            long[] measurements = new long[times];

            ExecutorService es = Executors.newFixedThreadPool(par);
            List<Future<?>> futures = new ArrayList<Future<?>>(times);
            for (int i = 0; i < times; i++) {
                long start = System.currentTimeMillis();
                final int sz = size / par;
                futures.clear();
                for (int j = 0; j < par; j++) {
                    final Object[] arr = new Object[s];
                    futures.add(es.submit(new Runnable() {
                        @Override
                        public void run() {
                            final int bits = 7, arraySize = 1 << bits;
                            int i = 0;
                            int pos = 32;
                            Object v = new Object();
                            while (i < sz) {
                                if (i % 2 == 0) {
                                    arr[pos] = v;
                                    pos += 1;
                                } else {
                                    pos -= 1;
                                    v = arr[pos];
                                }
                                i++;
                            }
                        }
                    }));
                }
                for (Future<?> future : futures)
                    future.get();

                long time = System.currentTimeMillis() - start;
//                System.out.println(i + ") Running time: " + time + " ms");
                measurements[i] = time;
            }
            es.shutdown();
            System.out.println("par = " + par + " arr.length= "+ s  + " >>> All running times: " + Arrays.toString(measurements));
        }
    }
}

this shows the distance between access values matters. By allocating an array is each thread, you use different TLABs (which space out the data in blocks)

par = 8 arr.length= 64 >>> All running times: [539, 413, 444, 444, 457, 444, 456]
par = 8 arr.length= 256 >>> All running times: [398, 527, 514, 529, 445, 441, 445]
par = 8 arr.length= 1024 >>> All running times: [419, 507, 477, 422, 412, 452, 396]
par = 8 arr.length= 4096 >>> All running times: [316, 282, 250, 232, 242, 229, 238]
par = 8 arr.length= 16384 >>> All running times: [316, 207, 209, 212, 208, 208, 208]
par = 8 arr.length= 65536 >>> All running times: [211, 211, 208, 208, 208, 291, 206]
par = 8 arr.length= 262144 >>> All running times: [366, 210, 210, 210, 210, 209, 211]
par = 8 arr.length= 1048576 >>> All running times: [296, 211, 215, 216, 213, 211, 211]

if you move the array inside the thread you get

par = 8 arr.length= 64 >>> All running times: [225, 151, 151, 150, 152, 153, 152]
par = 8 arr.length= 256 >>> All running times: [155, 151, 151, 151, 151, 151, 155]
par = 8 arr.length= 1024 >>> All running times: [153, 152, 151, 151, 151, 155, 152]
par = 8 arr.length= 4096 >>> All running times: [155, 156, 151, 152, 151, 155, 155]
par = 8 arr.length= 16384 >>> All running times: [154, 157, 152, 152, 158, 153, 153]
par = 8 arr.length= 65536 >>> All running times: [155, 157, 152, 184, 181, 154, 153]
par = 8 arr.length= 262144 >>> All running times: [240, 159, 166, 151, 172, 154, 160]
par = 8 arr.length= 1048576 >>> All running times: [165, 162, 163, 162, 163, 162, 163]

Turn off the tlab with -XX:-UseTLAB and the same code give syou

par = 8 arr.length= 64 >>> All running times: [608, 467, 467, 457, 468, 461, 428]
par = 8 arr.length= 256 >>> All running times: [437, 437, 522, 512, 522, 369, 535]
par = 8 arr.length= 1024 >>> All running times: [394, 395, 475, 525, 470, 440, 478]
par = 8 arr.length= 4096 >>> All running times: [347, 215, 238, 226, 236, 204, 271]
par = 8 arr.length= 16384 >>> All running times: [291, 157, 178, 151, 150, 151, 152]
par = 8 arr.length= 65536 >>> All running times: [163, 152, 162, 151, 159, 159, 154]
par = 8 arr.length= 262144 >>> All running times: [164, 172, 152, 169, 160, 161, 160]
par = 8 arr.length= 1048576 >>> All running times: [295, 153, 164, 153, 166, 154, 163]

answered Oct 17 '22 02:10

Peter Lawrey

Related questions
                            
                                Gradle compileJava cannot find symbol from source file
                            
                                Google App Engine - automatic-scaling with always on instance?
                            
                                Why are annotation string values not interned?
                            
                                Beside testing, why do we need Dagger 2?
                            
                                Eclipse Oxygen generate toString generation not working
                            
                                Activity Class cannot be found in AAR file java.lang.NoClassDefFoundError
                            
                                OkHttp3 Never Timeout on slow internet
                            
                                How to solve java file mapped to not java file type text
                            
                                File.exists() returns false for file (directory) that actually exists
                            
                                Java "this"-keyword marked as static in eclipse content assist
                            
                                Which Java GUI framework to choose now? [closed]
                            
                                Can IntelliJ IDEA open more than one editor window for files from the same project?
                            
                                Send Ctrl-C to process open by Java
                            
                                Situations to prefer Apache Lucene over Solr?
                            
                                How can Swing dialogs even work?
                            
                                Moving from .NET to Java [duplicate]
                            
                                Java localization best practices
                            
                                What's the difference between DataProvider and Factory in TestNG?
                            
                                Excluding fields from @Embedded properties on case by case basis with Hibernate/JPA
                            
                                Tomcat, HTTP Keep-Alive and Java's HttpsUrlConnection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Array allocation and access on the Java Virtual Machine and memory contention

Tags:

java

arrays

memory-management

memory

concurrency

axel22

People also ask

2 Answers

axel22

Peter Lawrey

Recent Activity

Donate For Us