Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The differences in the accuracy of the calculations in single / multi-threaded (OpenMP) modes

Can anybody explain/understand the different of the calculation result in single / multi-threaded mode?

Here is an example of approx. calculation of pi:

#include <iomanip>
#include <cmath>
#include <ppl.h>

const int itera(1000000000);

int main()
{
    printf("PI calculation \nconst int itera = 1000000000\n\n");

    clock_t start, stop;

    //Single thread
    start = clock();
    double summ_single(0);
    for (int n = 1; n < itera; n++)
    {
        summ_single += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
    };
    stop = clock();
    printf("Time single thread             %f\n", (double)(stop - start) / 1000.0);


    //Multithread with OMP
    //Activate OMP in Project settings, C++, Language
    start = clock();
    double summ_omp(0);
#pragma omp parallel for reduction(+:summ_omp)
    for (int n = 1; n < itera; n++)
    {
        summ_omp += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
    };
    stop = clock();
    printf("Time OMP parallel              %f\n", (double)(stop - start) / 1000.0);


    //Multithread with Concurrency::parallel_for
    start = clock();
    Concurrency::combinable<double> piParts;
    Concurrency::parallel_for(1, itera, [&piParts](int n)
    {
        piParts.local() += 6.0 / (static_cast<double>(n)* static_cast<double>(n)); 
    }); 

    double summ_Conparall(0);
    piParts.combine_each([&summ_Conparall](double locali)
    {
        summ_Conparall += locali;
    });
    stop = clock();
    printf("Time Concurrency::parallel_for %f\n", (double)(stop - start) / 1000.0);

    printf("\n");
    printf("pi single = %15.12f\n", std::sqrt(summ_single));
    printf("pi omp    = %15.12f\n", std::sqrt(summ_omp));
    printf("pi comb   = %15.12f\n", std::sqrt(summ_Conparall));
    printf("\n");

    system("PAUSE");

}

And the results:

PI calculation VS2010 Win32
Time single thread 5.330000
Time OMP parallel 1.029000
Time Concurrency:arallel_for 11.103000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592651497


PI calculation VS2013 Win32
Time single thread 5.200000
Time OMP parallel 1.291000
Time Concurrency:arallel_for 7.413000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592647841


PI calculation VS2010 x64
Time single thread 5.190000
Time OMP parallel 1.036000
Time Concurrency::parallel_for 7.120000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592649319


PI calculation VS2013 x64
Time single thread 5.230000
Time OMP parallel 1.029000
Time Concurrency::parallel_for 5.326000

pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592648489

The tests were made on AMD and Intel CPUs, Win 7 x64.

What is the reason for difference between PI calculation in single and multicore? Why the result of calculation with Concurrency::parallel_for is not constant on different builds (compiler, 32/64 bit platform)?

P.S. Visual studio express doesn’t support OpenMP.

like image 556
Märs Avatar asked Oct 16 '25 13:10

Märs


1 Answers

Floating-point addition is a non-associative operation due to round-off errors, therefore the order of operations matters. Having your parallel program give different result(s) than the serial version is something normal. Understanding and dealing with it is part of the art of writing (portable) parallel codes. This is exacerbated in the 32- against 64-bit builds since in 32-bit mode the VS compiler uses x87 instructions and the x87 FPU does all operations with an internal precision of 80 bits. In 64-bit mode SSE math is used.

In the serial case, one thread computes s1+s2+...+sN, where N is the number of terms in the expansion.

In the OpenMP case there are n partial sums, where n is the number of OpenMP threads. Which terms get into each partial sum depends on the way iterations are distributed among the threads. The default for many OpenMP implementations is static scheduling, which means that thread 0 (the main thread) computes ps0 = s1 + s2 + ... + sN/n; thread 1 computes ps1 = sN/n+1 + sN/n+2 + ... + s2N/n; and so on. In the end the reduction combines somehow those partial sums.

The parallel_for case is very similar to the OpenMP one. The difference is that by default the iterations are distributed in a dynamic fashion - see the documentation for auto_partitioner, therefore each partial sum contains a more or less random selection of terms. This not only gives a slightly different result, but it also gives a slightly different result with each execution, i.e. the result from two consecutive parallel_for's with the same number of threads might differ a bit. If you replace the partitioner with an instance of simple_partitioner and set the chunk size equal to itera / number-of-threads, you should get the same result as in the OpenMP case if the reduction is performed the same way.

You could use Kahan summation and implement your own reduction also using Kahan summation. Then the parallel codes should produce the same (over much more similar) result as the serial one.

like image 111
Hristo Iliev Avatar answered Oct 19 '25 13:10

Hristo Iliev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!