Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP embarrassingly parallel for loop, no speedup

I have what seems to be a very simple parallel for loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop gets. I thought that this was due to some cache thrashing so I played around with schedules, chunk size, __restrict__, nesting the parallel for inside a parallel block, and flushes. Then I noticed that reading an array to do a reduction is also slower.

This should be obviously very simple and should speed up nearly linearly. What am I missing here?

Full code:

#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>

void tic(), toc();

int main(int argc, const char *argv[])
{
    const int COUNT = 100;
    const size_t sz = 250000 * 200;
    std::vector<int> vec(sz, 1);

    std::cout << "max threads: " << omp_get_max_threads()<< std::endl;

    std::cout << "serial reduction" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        double sum = 0;
        for(size_t i = 0; i < sz; ++ i)
            sum += vec[i];
    }
    toc();

    int *const ptr = vec.data();
    const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int

    std::cout << "parallel reduction" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        double sum = 0;
        #pragma omp parallel for default(none) reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    toc();

    std::cout << "serial memset" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            vec[i] = 0;
    }
    toc();

    std::cout << "parallel memset" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none)
        for(int i = 0; i < sz_i; ++ i)
            ptr[i] = 0;
    }
    toc();

    return 0;
}

static clock_t ClockCounter;

void tic()
{
    ClockCounter = std::clock();
}

void toc()
{
    ClockCounter = std::clock() - ClockCounter;
    std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}

Running this yields:

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 1790000
parallel reduction
  elapsed clock ticks: 19690000
serial memset
  elapsed clock ticks: 3860000
parallel memset
  elapsed clock ticks: 20800000

If I run with -O2, g++ is able to optimize the serial reduction away and I get zero time, thus -O1. Also, putting omp_set_num_threads(1); makes the times more similar, although there are still some differences:

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 1
serial reduction
  elapsed clock ticks: 1770000
parallel reduction
  elapsed clock ticks: 7370000
serial memset
  elapsed clock ticks: 2290000
parallel memset
  elapsed clock ticks: 3550000

This should be fairly obvious, I feel like I'm not seeing something very elementary. My CPU is Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz with hyperthreading, but the same behavior is observed at a colleague's i5 with 4 cores and no hyperthreading. We're both running Linux.

EDIT

It seems that one err was on the side of timing, running with:

static double ClockCounter;

void tic()
{
    ClockCounter = omp_get_wtime();//std::clock();
}

void toc()
{
    ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter;
    std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}

yields more "reasonable" times:

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 1.80974
parallel reduction
  elapsed clock ticks: 2.07367
serial memset
  elapsed clock ticks: 2.37713
parallel memset
  elapsed clock ticks: 2.23609

But still, there is no speedup, it is just not slower anymore.

EDIT2:

As suggested by user8046, the code is heavily memory bound. and as suggested by Z boson, the serial code is easily optimized away and it is not certain what is being measured here. So I did a small change of putting the sum outside of the loop, so that it does not zero out at every iteration over c. I also replaced the reduction operation with sum+=F(vec[i]) and memset operation with vec[i] = F(i). Running as:

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))"
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 23.9106
parallel reduction
  elapsed clock ticks: 3.35519
serial memset
  elapsed clock ticks: 43.7344
parallel memset
  elapsed clock ticks: 6.50351

Calculating the square root adds more work to the threads and there is finally some reasonable speedup (it is about 7x, which makes sense, as the hyperthreaded cores share memory lanes).

like image 303
the swine Avatar asked Sep 02 '14 13:09

the swine


3 Answers

You spotted the timing error. There is still no speedup because both of your test cases are heavily memory bound. On typical consumer hardware all of your cores share one memory bus, so using more threads does not give you more bandwidth and, since this is the bottleneck, speedup. This will probably change if you reduce your problem size so it will fit into cache or for sure if you increase the number of calculations per data, for example if you were calculating the reduction of exp(vec[i]) or 1/vec[i]. For the memset: you can saturate the memory with one thread, you will never see a speedup there. (Only if you have access to a second memory bus with more threads, as with some multi-socket architectures). One remark regarding the reduction, this is most probably not implemented with a lock, that would be horrible inefficient but using an addition tree which has not so bad logarithmic speedup.

like image 155
fweik Avatar answered Nov 10 '22 04:11

fweik


Besides your error in using the clock function in Linux the rest of your question can be answered by reading these questions/answers.

in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987

measuring-memory-bandwidth-from-the-dot-product-of-two-arrays

memset-in-parallel-with-threads-bound-to-each-physical-core

So you should see a significant benefit from multiple threads with memset and doing a reduction even on a single socket system. I have written my own tool to measure bandwidth for this. You can find some of the results from my my i5-4250U (Haswell) with 2 cores below (GCC 4.8, Linux 3.13, EGLIBC 2.19) runing over 1 GB. vsum is your reduction. Notice that there is a significant improvement even on this two core system.

one thread

C standard library
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.80       6.68       0.00        inf %
memcpy:              1.00       1.35       7.93       0.00        inf %

Agner Fog's asmlib
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.71       7.53       0.00        inf %
memcpy:              1.00       0.93      11.51       0.00        inf %

my_memset   
                     0.50       0.71       7.53       0.00        inf %


FMA3 reduction tests
                       GB    time(s)       GB/s     GFLOPS   efficiency
vsum:                0.50       0.53      10.08       2.52        inf %
vmul:                0.50       0.68       7.93       1.98        inf %
vtriad:              0.50       0.70       7.71       3.85        inf %
dot                  1.00       1.08       9.93       2.48        inf %

two threads

C standard library
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.64       8.33       0.00        inf %
memcpy:              1.00       1.10       9.76       0.00        inf %

Agner Fog's asmlib
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.36      14.98       0.00        inf %
memcpy:              1.00       0.66      16.30       0.00        inf %

my_memset
                     0.50       0.36      15.03       0.00        inf %


FMA3 tests
standard sum tests with OpenMP: 2 threads
                       GB    time(s)       GB/s     GFLOPS   efficiency
vsum:                0.50       0.41      13.03       3.26        inf %
vmul:                0.50       0.39      13.67       3.42        inf %
vtriad:              0.50       0.44      12.20       6.10        inf %
dot                  1.00       0.97      11.11       2.78        inf %

Here is my custom memset function (I have several other tests like this).

void my_memset(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    #pragma omp parallel for
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

Edit:

You should compile with -O3 and -ffast-math. Define the sum outside of the outerloop and then print it out so GCC does not optimize it away. GCC won't auto-vectorize a reduction because floating point arithmetic is not associative and vectorizing the loop could break IEEE floating point rules. Using -ffast-math allows floating point arithemetic to be associative which allows GCC to vectorize the reduction. It should be pointed out that already doing a reduction in OpenMP assumes the floating point arithmetic is associative so it already break IEEE floating point rules.

double sum = 0;
tic();
for(int c = 0; c < COUNT; ++ c) { 
    #pragma omp parallel for reduction(+:sum)
    for(int i = 0; i < sz_i; ++ i)
        sum += ptr[i];
}
toc();
printf("sum %f\n", sum);

Edit:

I tested your code and made some modifications. I get faster times with the reduction and memset using multiple threads

max threads: 4
serial reduction
dtime 1.86, sum 705032704
parallel reduction
dtime 1.39 s, sum 705032704
serial memset
dtime 2.95 s
parallel memset
dtime 2.44 s
serial my_memset
dtime 2.66 s
parallel my_memset
dtime 1.35 s

Here is the code I used (g++ foo.cpp -fopenmp -O3 -ffast-math)

#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>
#include <stdio.h>

#include <xmmintrin.h>

void my_memset(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

void my_memset_omp(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    #pragma omp parallel for
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

int main(int argc, const char *argv[])
{
    const int COUNT = 100;
    const size_t sz = 250000 * 200;
    std::vector<int> vec(sz, 1);

    std::cout << "max threads: " << omp_get_max_threads()<< std::endl;

    std::cout << "serial reduction" << std::endl;
    double dtime;
    int sum;

    dtime = -omp_get_wtime();
    sum = 0;
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            sum += vec[i];
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f, sum %d\n", dtime, sum);

    int *const ptr = vec.data();
    const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int

    std::cout << "parallel reduction" << std::endl;


    dtime = -omp_get_wtime();
    sum = 0;
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none) reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f s, sum %d\n", dtime, sum);

    std::cout << "serial memset" << std::endl;

    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            vec[i] = 0;
    }   
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "parallel memset" << std::endl;
    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none)
        for(int i = 0; i < sz_i; ++ i)
            ptr[i] = 0;
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "serial my_memset" << std::endl;

    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i);

    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "parallel my_memset" << std::endl;
    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i);
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    return 0;
}
like image 37
Z boson Avatar answered Nov 10 '22 04:11

Z boson


You are using std::clock which reports CPU time used, not real time. As such each processor's time is added up and will always be higher than single threaded time (due to overhead).

http://en.cppreference.com/w/cpp/chrono/c/clock

like image 1
dohashi Avatar answered Nov 10 '22 02:11

dohashi