Increased speed despite false sharing

Tags:

I've been doing some tests on OpenMP and made this program that should not scale because of false sharing of the array "sum". The problem I have is that it does scale. Even "worse":

with 1 thread: 4 seconds (icpc), 4 seconds (g++)
with 2 threads: 2 seconds (icpc), 2 seconds (g++)
with 4 thread: 0.5 seconds (icpc), 1 seconds (g++)

I really don't get the speedup I get from 2 threads to 4 threads with the Intel compilers. But the most important is: why is scaling so good even though it should exhibit false sharing?

#include <iostream>
#include <chrono>

#include <array>

#include <omp.h>

int main(int argc, const char *argv[])
{
    const auto nb_threads = std::size_t{4};
    omp_set_num_threads(nb_threads);

    const auto num_steps = std::size_t{1000000000};
    const auto step = double{1.0 / num_steps};
    auto sum = std::array<double, nb_threads>{0.0};
    std::size_t actual_nb_threads;

    auto start_time = std::chrono::high_resolution_clock::now();
    #pragma omp parallel
    {
        const auto id = std::size_t{omp_get_thread_num()};
        if (id == 0) {
            // This is needed because OMP might give us less threads
            // than the numbers of threads requested
            actual_nb_threads = omp_get_num_threads();
        }
        for (auto i = std::size_t{0}; i < num_steps; i += nb_threads) {
            auto x = double{(i + 0.5) * step};
            sum[id] += 4.0 / (1.0 + x * x);
        }
    }
    auto pi = double{0.0};
    for (auto id = std::size_t{0}; id < actual_nb_threads; id++) {
        pi += step * sum[id];
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(end_time - start_time).count();

    std::cout << "Pi: " << pi << std::endl;
    std::cout << "Time: " << time / 1.0e9 << " seconds" << std::endl;
    std::cout << "Total nb of threads actually used: " << actual_nb_threads << std::endl;

    return 0;
}

367

asked Jun 08 '15 09:06

InsideLoop

1 Answers

That code definitely could exhibit false sharing, if the compiler chose to implement it that way. But that would be a silly thing for the compiler to do.

In the first loop, each thread only accesses one element of sum. There's no reason to make num_steps writes to the actual stack memory storing that element; it's much faster to just keep the value in a register, and write it back after the for-loop is over. Since the array is not volatile or atomic, there's nothing stopping the compiler from behaving in this way.

And, of course, in the second loop there's no writing to the array, so no false sharing.

181

answered Oct 17 '22 03:10

Sneftel

Related questions
                            
                                Is memory ordering in C++11 about main memory flush ordering?
                            
                                Portable emulation of flexible array member in C++?
                            
                                Eclipse compiler include path subdir
                            
                                Singleton Pattern Object with Parameters
                            
                                How would i support multiple version of TLS on client side?
                            
                                Rank-Preserving Data Structure other than std:: vector?
                            
                                c++ function resolution selects templated version over plain function
                            
                                How can we use std::transform, if we don't want to transform each element into one transformed element, but two?
                            
                                __sync_val_compare_and_swap vs __sync_bool_compare_and_swap
                            
                                C ++ std::map with custom comparer for keeping tournament of matches
                            
                                what happens if I cast (a pointer to class A) to (a pointer to its subclass B) in c++
                            
                                Different variadic template expansion
                            
                                C++ unsafe cast workaround
                            
                                Does GCC optimize std::tie used only for readability?
                            
                                How to SFINAE out non-containers parameters
                            
                                disabled exceptions and noexcept()
                            
                                Using curly braces to segregate a variable that I want to use multiple times in C++
                            
                                Predictable exit code of crashed process in Windows
                            
                                using extern keyword for user defined types in c++
                            
                                for_each & ranged base for on 2D array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Increased speed despite false sharing

Tags:

c++

multithreading

openmp

false-sharing

InsideLoop

People also ask

1 Answers

Sneftel

Recent Activity

Donate For Us