Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency of Multithreaded Loops

Greetings noble community,

I want to have the following loop:

for(i = 0; i < MAX; i++)
    A[i] = B[i] + C[i];

This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.

(for simplicity, assume MAX is a multiple of 4)

Option 1:

for(i = tid; i < MAX; i += 4)
    A[i] = B[i] + C[i];

Option 2:

for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
    A[i] = B[i] + C[i];

My question is if there's one that is more efficient then the other and why?

like image 488
Francisco Maria Calisto Avatar asked Jan 27 '15 20:01

Francisco Maria Calisto


People also ask

Why are busy loops inefficient?

Abstract. A busy wait loop is a loop which repeatedly checks whether an event occurs. Busy wait loops for process synchronization and com- munication are considered bad practice because (1) system failures may occur due to race conditions and (2) system resources are wasted by busy wait loops.

How does multithreading work in mobile apps?

Multithreading is a model of program execution that allows for multiple threads to be created within a process, executing independently but concurrently sharing process resources. Depending on the hardware, threads can run fully parallel if they are distributed to their own CPU core.

What is multithreading in Android Studio?

Working on multiple tasks at the same time is Multitasking. In the same way, multiple threads running at the same time in a machine is called Multi-Threading. Technically, a thread is a unit of a process. Multiple such threads combine to form a process.


1 Answers

The second one is better than the first one. Simple answer: the second one minimize false sharing

Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.

When would this happen?

Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.

class Foo {

private int foo1;
private int foo2;

}

How bad is false sharing?

I refer Example 6 from the Gallery of Processor Cache Effects

private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}

On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done. On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!

How to detect false sharing?

Linux Perf could be used to detect cache misses and therefore help you analysis such problem.

refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:

Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses     #    1.54% of all L1-dcache hits   [51.24%]
Performance counter stats for './cache_line_test 16 32 48 64':
  36,992 L1-dcache-load-misses     #    0.01% of all L1-dcache hits   [50.51%]

It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.

Is there some good practice in industry?

LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.

For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:

abstract class RingBufferPad
{
    protected long p1, p2, p3, p4, p5, p6, p7;
}

Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:

this.entries   = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];

source

You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast

like image 64
qqibrow Avatar answered Sep 24 '22 15:09

qqibrow