Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c++ openmp false-sharing on aligned array example

I would like to see the effect of false sharing. To do so, I tried to design a small experiment but I got unexpected results.

I have an array containing 100 m integers. Consider it as m x n matrix. One thread changes odd indexed rows and other thread changes even indexed rows.

Experiment A: The number of columns is 16. So each row is 64 bytes, it is exactly my cacheline size. Since each thread processes exactly 1 cacheline at a time, there should be no false-sharing. Therefore, I expect around 100% speedup.

Experiment B: The number of columns is 8. Each thread changes 32 bytes at a time, which is half of cacheline. For example, if thread 1 processes row 33, data should be transferred from thread 0 because thread 1 has already processed row 32 which is in the same cacheline. (or vice versa, the order does not matter). Because of this communitcation, speedup should be low.

#include <iostream>
#include <omp.h>

using namespace std;

int main(int argc, char** argv) {

    if(argc != 3) {
        cout << "Usage: " << argv[0] << " <iteration> <col_count>" << endl;
        return 1;
    }

    int thread_count = omp_get_max_threads();
    int iteration = atoi(argv[1]);
    int col_count = atoi(argv[2]);
    int arr_size = 100000000;

    int* A = (int*) aligned_alloc(16 * sizeof(int), arr_size * sizeof(int));

    int row_count = arr_size / col_count; 
    int row_count_per_thread = row_count / thread_count;

    #pragma omp parallel
    {
        int thread_id = omp_get_thread_num();

        long long total = 1ll * iteration * row_count_per_thread * col_count;
        printf("%lld\n", total);

        for(int t = 0; t < iteration; t++) {

            for(int i = 0; i < row_count_per_thread; i++) {

                int start = (i * thread_count + thread_id) * col_count;
                for(int j = start; j < start + col_count; j++) {


                    if(A[j] % 2 == 0)
                        A[j] += 3;
                    else
                        A[j] += 1;
                }
            }
        }
    }

    return 0;
}

I run this code with different configurations with the following way:

time taskset -c 0-1 ./run 100 16

Here are the results for 100 iteration :

Thread      Column      Optimization        Time (secs)
_______________________________________________________
1           16          O3                  7.6
1           8           O3                  7.7
2           16          O3                  7.7
2           8           O3                  7.7

1           16          O0                  35.9
1           8           O0                  34.3
2           16          O0                  19.3
2           8           O0                  18.2

As you can see, although O3 optimization gives the best results, they are very strange because increasing the number of threads does not give any speed up. For me, O0 optimizations results are more interpretable.

The real question: Look at last 2 lines. For both cases, I got almost %100 speedup however I expect that execution time of experiment B should be much higher since it has a false-sharing issue. What is wrong with my experiment or my understanding?

I compiled it with g++ -std=c++11 -Wall -fopenmp -O0 -o run -Iinc $(SOURCE) and g++ -std=c++11 -Wall -fopenmp -O3 -o run -Iinc $(SOURCE)

Let me know if my problem is not clear or need more detail.


Update: Specs:

MemTotal:        8080796 kB
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               71
Model name:          Intel(R) Core(TM) i7-5700HQ CPU @ 2.70GHz
Stepping:            1
CPU MHz:             2622.241
CPU max MHz:         3500,0000
CPU min MHz:         800,0000
BogoMIPS:            5387.47
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-7

Update 2: I have tried different iteration_count and arr_size parameters so that the array fits in the L2, L1 caches while making total number of element change constant. But the results are still the same.

Thank you.

like image 782
Seljuk Gülcan Avatar asked Nov 13 '18 11:11

Seljuk Gülcan


1 Answers

Your -O3 timing seems to be consistent with 1-channel memory access speed of your processor. You might probably get up to 2x better speed by using a 2-channel memory configuration, but it is unlikely to introduce any other difference to your results. Bear in mind that on your processor there is a single L3 cache shared between the cores, so any false-sharing will highly likely be resolved on the L3 cache level and won't result in additional load on the external memory bus.

There are a lot more problems (than just "slow" memory access) with your code that may prevent you from seeing the effects of false-sharing.

First, it's quite unlikely that both your threads will compete for exactly the same cache line, given the timing unpredictability involved in thread scheduling.

Second, even if they do have a conflict, it will be temporary, because any factor that leads to asymmetric slow-down will cause the "slower" thread to delay its scanning until it's out of the memory range of conflict.

Third, if they happen to run on two hardware threads of the same core, they will access exactly the same instances of the cache, and there will be no cache conflicts.

To "fix" all of this, you need more threads (or threads bound to particular cores) and a much tighter memory area for possible conflicts. The "best" results will be if your threads compete for just one cache line of memory.

like image 137
Kit. Avatar answered Oct 27 '22 02:10

Kit.