Efficiency of Multithreaded Loops

Tags:

Greetings noble community,

I want to have the following loop:

for(i = 0; i < MAX; i++)
    A[i] = B[i] + C[i];

This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.

(for simplicity, assume MAX is a multiple of 4)

Option 1:

for(i = tid; i < MAX; i += 4)
    A[i] = B[i] + C[i];

Option 2:

for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
    A[i] = B[i] + C[i];

My question is if there's one that is more efficient then the other and why?

488

asked Jan 27 '15 20:01

Francisco Maria Calisto

1 Answers

The second one is better than the first one. Simple answer: the second one minimize false sharing

Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.

When would this happen?

Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.

class Foo {

private int foo1;
private int foo2;

}

How bad is false sharing?

I refer Example 6 from the Gallery of Processor Cache Effects

private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}
On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done. On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!

How to detect false sharing?

Linux Perf could be used to detect cache misses and therefore help you analysis such problem.

refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:

Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses     #    1.54% of all L1-dcache hits   [51.24%]

Performance counter stats for './cache_line_test 16 32 48 64':
  36,992 L1-dcache-load-misses     #    0.01% of all L1-dcache hits   [50.51%]

It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.

Is there some good practice in industry?

LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.

For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:

abstract class RingBufferPad
{
    protected long p1, p2, p3, p4, p5, p6, p7;
}

Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:

this.entries   = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];

source

You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast

answered Sep 24 '22 15:09

qqibrow

Related questions
                            
                                Why is it not allowed to throw an exception in a Java instance initialization block?
                            
                                jpql date comparison today
                            
                                set the table cell width in iText java pdf
                            
                                How to specify base dir then we run ant like ant -f somedir/dir/build.xml
                            
                                How to use the Comparator interface
                            
                                this.getClass().getFields().length; always returns 0 [duplicate]
                            
                                Double value with specific precision in java
                            
                                Selenium WebDriver can't find element by link text
                            
                                How to create immutable class in java
                            
                                Why is the "new" keyword so much more efficient than assignment?
                            
                                What is build-by-convention in Gradle deep explanation?
                            
                                Is there already a StopWatch class for android and why doesn't my implementation work?
                            
                                JMeter environment specific configuration
                            
                                enabling the lambda expression
                            
                                Wrong Java Compiler When Including a Java Module as Dependency in Android Studio
                            
                                Jsoup get redirected URL
                            
                                newInstance() with inner classes
                            
                                Accessing JPA Data with REST fails No suitable HttpMessageConverter found
                            
                                How to make maven build of child module with parent module?
                            
                                how to create comma separated string in single quotes from arraylist of string in JAVA

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiency of Multithreaded Loops

Tags:

java

c++

performance

multithreading

shared-memory