Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is serial execution taking less time than parallel? [duplicate]

I have to add two vectors and compare serial performance against parallel performance. However, my parallel code seems to take longer to execute than the serial code.

Could you please suggest changes to make the parallel code faster?

#include <iostream>
#include <time.h>
#include "omp.h"
#define ull unsigned long long

using namespace std;

void parallelAddition (ull N, const double *A, const double *B, double *C)
{
    ull i;

    #pragma omp parallel for shared (A,B,C,N) private(i) schedule(static)
    for (i = 0; i < N; ++i)
    {
        C[i] = A[i] + B[i];
    }
}

int main(){

ull n = 100000000;
double* A = new double[n];
double* B = new double[n];
double* C = new double[n];

double time_spent = 0.0;


for(ull i = 0; i<n; i++)
{
    A[i] = 1;
    B[i] = 1;
}

//PARALLEL
clock_t begin = clock();
parallelAddition(n, &A[0], &B[0], &C[0]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;

cout<<"time elapsed in parallel : "<<time_spent<<endl;


//SERIAL 
time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
    A[i] = 1;
    B[i] = 1;
}
begin = clock();
for (ull i = 0; i < n; ++i)
{
    C[i] = A[i] + B[i];
}
end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}

These are results:

time elapsed in parallel : 0.824808

time elapsed in serial : 0.351246

I've read on another thread that there are factors like spawning of threads, allocation of resources. But I don't know what to do to get the expected result.


EDIT:

Thanks! @zulan and @Daniel Langr 's answers actually helped!

I used omp_get_wtime() instead of clock(). It happens to be that clock() measures cumulative time of all threads as against omp_get_wtime() which can be used to measure the time elasped from an arbitrary point to some other arbitrary point

This answer too answers this query pretty well: https://stackoverflow.com/a/10874371/4305675

Here's the fixed code:

void parallelAddition (ull N, const double *A, const double *B, double *C)
{
    ....
}

int main(){

    ....


    //PARALLEL
    double begin = omp_get_wtime();
    parallelAddition(n, &A[0], &B[0], &C[0]);
    double end = omp_get_wtime();
    time_spent += (double)(end - begin);

    cout<<"time elapsed in parallel : "<<time_spent<<endl;



    ....

    //SERIAL
    begin = omp_get_wtime();
    for (ull i = 0; i < n; ++i)
    {
        C[i] = A[i] + B[i];
    }
    end = omp_get_wtime();
    time_spent += (double)(end - begin);
    cout<<"time elapsed in serial : "<<time_spent;
return 0;
}

RESULT AFTER CHANGES:

time elapsed in parallel : 0.204763

time elapsed in serial : 0.351711

like image 730
Nitish Prajapati Avatar asked Oct 29 '19 14:10

Nitish Prajapati


1 Answers

There are multiple factors that influence your measurements:

  1. Use omp_get_wtime() as @zulan suggested, otherwise, you may actually calculate combined CPU time, instead of wall time.

  2. Threading has some overhead and typically does not pay off for short calculations. You may want to use higher n.

  3. "Touch" data in C array before running parallelAddition. Otherwise, the memory pages are actually allocated from OS inside parallelAddition. Easy fix since C++11: double* C = new double[n]{};.

I tried your program for n being 1G and the last change reduced runtime of parallelAddition from 1.54 to 0.94 [s] for 2 threads. Serial version took 1.83 [s], therefore, the speedup with 2 threads was 1.95, which was pretty close to ideal.

Other considerations:

  • Generally, if you profile something, make sure that the program has some observable effect. Otherwise, a compiler may optimize a lot of code away. Your array addition has no observable effect.

  • Add some form of restrict keyword to the C parameter. Without it, a compiler might not be able to apply vectorization.

  • If you are on a multi-socket system, take care about affinity of threads and NUMA effects. On my dual-socket system, runtime of a parallel version for 2 threads took 0.94 [s] (as mentioned above) when restricting threads to a single NUMA node (numactl -N 0 -m 0). Without numactl, it took 1.35 [s], thus 1.44 times more.

like image 51
Daniel Langr Avatar answered Sep 19 '22 05:09

Daniel Langr