I've been doing some tests on OpenMP and made this program that should not scale because of false sharing of the array "sum". The problem I have is that it does scale. Even "worse":
I really don't get the speedup I get from 2 threads to 4 threads with the Intel compilers. But the most important is: why is scaling so good even though it should exhibit false sharing?
#include <iostream>
#include <chrono>
#include <array>
#include <omp.h>
int main(int argc, const char *argv[])
{
const auto nb_threads = std::size_t{4};
omp_set_num_threads(nb_threads);
const auto num_steps = std::size_t{1000000000};
const auto step = double{1.0 / num_steps};
auto sum = std::array<double, nb_threads>{0.0};
std::size_t actual_nb_threads;
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel
{
const auto id = std::size_t{omp_get_thread_num()};
if (id == 0) {
// This is needed because OMP might give us less threads
// than the numbers of threads requested
actual_nb_threads = omp_get_num_threads();
}
for (auto i = std::size_t{0}; i < num_steps; i += nb_threads) {
auto x = double{(i + 0.5) * step};
sum[id] += 4.0 / (1.0 + x * x);
}
}
auto pi = double{0.0};
for (auto id = std::size_t{0}; id < actual_nb_threads; id++) {
pi += step * sum[id];
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(end_time - start_time).count();
std::cout << "Pi: " << pi << std::endl;
std::cout << "Time: " << time / 1.0e9 << " seconds" << std::endl;
std::cout << "Total nb of threads actually used: " << actual_nb_threads << std::endl;
return 0;
}
In general, false sharing can be reduced using the following techniques: Make use of private or threadprivate data as much as possible. Use the compiler's optimization features to eliminate memory loads and stores. Pad data structures so that each thread's data resides on a different cache line.
This occurs because cache coherency is maintained on a cache-line basis and not for individual variables or elements. With false sharing, a thread is forced to fetch a more recent copy of a cache line from memory, even though the variable it is attempting to access has not been modified.
In some cases, a change in the way the data is allocated can reduce false sharing. In other cases, changing the mapping of iterations to threads, giving each thread more work per chunk (by changing the chunksize value) can also lead to a reduction in false sharing.
As I understand it, true sharing refers to the problem of multiple cores frequently writing to the same shared variable, while false sharing refers to multiple cores writing to different variables that are on the same cache line.
That code definitely could exhibit false sharing, if the compiler chose to implement it that way. But that would be a silly thing for the compiler to do.
In the first loop, each thread only accesses one element of sum
. There's no reason to make num_steps
writes to the actual stack memory storing that element; it's much faster to just keep the value in a register, and write it back after the for-loop is over. Since the array is not volatile or atomic, there's nothing stopping the compiler from behaving in this way.
And, of course, in the second loop there's no writing to the array, so no false sharing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With