Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP GCC GOMP wasteful barrier

Tags:

gcc

openmp

I have the following program. nv is around 100, dgemm is 20x100 or so, so there is plenty of work to go around:

#pragma omp parallel for schedule(dynamic,1)
        for (int c = 0; c < int(nv); ++c) {
            omp::thread thread;                                               
            matrix &t3_c = vv_.at(omp::num_threads()+thread);
            if (terms.first) {
                blas::gemm(1, t2_, vvvo_, 1, t3_c);
                blas::gemm(1, vvvo_, t2_, 1, t3_c);
            }

            matrix &t3_b = vv_[thread];
            if (terms.second) {
                matrix &t2_ci = vo_[thread];
                blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
                blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
            }
        }

however with GCC 4.4, GOMP v1, the gomp_barrier_wait_end accounts for nearly 50% of runtime. Changing GOMP_SPINCOUNT aleviates the overhead but then only 60% of cores are used. Same for OMP_WAIT_POLICY=passive. The system is Linux, 8 cores.

How can i get full utilization without spinning/waiting overhread

like image 442
Anycorn Avatar asked Apr 18 '11 01:04

Anycorn


2 Answers

The barrier is a symptom, not the problem. The reason that there's lots of waiting at the end of the loop is that some of the threads are done well before the others, and they all wait at the end of the for loop for quite a while until everyone's done.

This is a classic load imbalance problem, which is weird here, since it's just a bunch of matrix multiplies. Are they of varying sizes? How are they laid out in memory, in terms of NUMA stuff - are they all currently sitting in one core's cache, or are there other sharing issues? Or, more simply -- are there only 9 matricies, so that the remaining 8 are doomed to be stuck waiting for whoever got the last one?

When this sort of thing happens in a larger parallel block of code, sometime it's ok to proceed to the next block of code while some of the loop iterations aren't done yet; there you can add the nowait directive to the for which will override the default behaviour and get rid of the implied barrier. Here, though, since the parallel block is exactly the size of the for loop, that can't really help.

like image 138
Jonathan Dursi Avatar answered Oct 16 '22 03:10

Jonathan Dursi


Could it be that your BLAS implementation also calls OpenMP inside? Unless you only see one call to gomp_barrier_wait_end.

like image 21
ipapadop Avatar answered Oct 16 '22 01:10

ipapadop