I am a newbie in programming with OpenMp. I wrote a simple c program to multiply matrix with a vector. Unfortunately, by comparing executing time I found that the OpenMP is much slower than the Sequential way. Here is my code (Here the matrix is N*N int, vector is N int, result is N long long): <pre class="prettyprint"><code>#pragma omp parallel for private(i,j) shared(matrix,vector,result,m_size) for(i=0;i<m_size;i++) { for(j=0;j<m_size;j++) { result[i]+=matrix[i][j]*vector[j]; } } </code></pre> And this is the code for sequential way: <pre class="prettyprint"><code>for (i=0;i<m_size;i++) for(j=0;j<m_size;j++) result[i] += matrix[i][j] * vector[j]; </code></pre> When I tried these two implementations with a 999x999 matrix and a 999 vector, the execution time is: Sequential: 5439 ms Parallel: 11120 ms I really cannot understand why OpenMP is much slower than sequential algo (over 2 times slower!) Anyone who can solve my problem?

Your code partially suffers from the so-called false sharing, typical for all cache-coherent systems. In short, many elements of the <code>result[]</code> array fit in the same cache line. When thread <code>i</code> writes to <code>result[i]</code> as a result of the <code>+=</code> operator, the cache line holding that part of <code>result[]</code> becomes dirty. The cache coherency protocol then invalidates all copies of that cache line in the other cores and they have to refresh their copy from the upper level cache or from the main memory. As <code>result</code> is an array of <code>long long</code>, then one cache line (64 bytes on x86) holds 8 elements and besides <code>result[i]</code> there are 7 other array elements in the same cache line. Therefore it is possible that two "neighbouring" threads will constantly fight for ownership of the cache line (assuming that each thread runs on a separate core). To mitigate false sharing in your case, the easiest thing to do is to ensure that each thread gets an iteration block, whose size is divisible by the number of elements in the cache line. For example you can apply the <code>schedule(static,something*8)</code> where <code>something</code> should be big enough so that the iteration space is not fragmented into too many pieces, but in the same time it should be small enough so that each thread gets a block. E.g. for <code>m_size</code> equal to 999 and 4 threads you would apply the <code>schedule(static,256)</code> clause to the <code>parallel for</code> construct. Another partial reason for the code to run slower might be that when OpenMP is enabled, the compiler might become reluctant to apply some code optimisations when shared variables are being assigned to. OpenMP provides for the so-called relaxed memory model where it is allowed that the local memory view of a shared variable in each threads is different and the <code>flush</code> construct is provided in order to synchronise the views. But compilers usually see shared variables as being implicitly <code>volatile</code> if they cannot prove that other threads would not need to access desynchronised shared variables. You case is one of those, since <code>result[i]</code> is only assigned to and the value of <code>result[i]</code> is never used by other threads. In the serial case the compiler would most likely create a temporary variable to hold the result from the inner loop and would only assign to <code>result[i]</code> once the inner loop has finished. In the parallel case it might decide that this would create a temporary desynchronised view of <code>result[i]</code> in the other threads and hence decide not to apply the optimisation. Just for the record, GCC 4.7.1 with <code>-O3 -ftree-vectorize</code> does the temporary variable trick with both OpenMP enabled and not.

Because when OpenMP distributes the work among threads there is a lot of administration/synchronisation going on to ensure the values in your shared matrix and vector are not corrupted somehow. Even though they are read-only: humans see that easily, your compiler may not. Things to try out for pedagogic reasons: 0) What happens if <code>matrix</code> and <code>vector</code> are not <code>shared</code>? 1) Parallelize the inner "j-loop" first, keep the outer "i-loop" serial. See what happens. 2) Do not collect the sum in <code>result[i]</code>, but in a variable <code>temp</code> and assign its contents to <code>result[i]</code> only after the inner loop is finished to avoid repeated index lookups. Don't forget to init <code>temp</code> to 0 before the inner loop starts.

Optimising and why openmp is much slower than sequential way?

Tags:

performance

c

vector

matrix

openmp

I am a newbie in programming with OpenMp. I wrote a simple c program to multiply matrix with a vector. Unfortunately, by comparing executing time I found that the OpenMP is much slower than the Sequential way.

Here is my code (Here the matrix is N*N int, vector is N int, result is N long long):

#pragma omp parallel for private(i,j) shared(matrix,vector,result,m_size)
for(i=0;i<m_size;i++)
{  
  for(j=0;j<m_size;j++)
  {  
    result[i]+=matrix[i][j]*vector[j];
  }
}

And this is the code for sequential way:

for (i=0;i<m_size;i++)
        for(j=0;j<m_size;j++)
            result[i] += matrix[i][j] * vector[j];

When I tried these two implementations with a 999x999 matrix and a 999 vector, the execution time is:

Sequential: 5439 ms Parallel: 11120 ms

I really cannot understand why OpenMP is much slower than sequential algo (over 2 times slower!) Anyone who can solve my problem?

986

asked May 04 '13 07:05

Alex Zhou

2 Answers

Your code partially suffers from the so-called false sharing, typical for all cache-coherent systems. In short, many elements of the result[] array fit in the same cache line. When thread i writes to result[i] as a result of the += operator, the cache line holding that part of result[] becomes dirty. The cache coherency protocol then invalidates all copies of that cache line in the other cores and they have to refresh their copy from the upper level cache or from the main memory. As result is an array of long long, then one cache line (64 bytes on x86) holds 8 elements and besides result[i] there are 7 other array elements in the same cache line. Therefore it is possible that two "neighbouring" threads will constantly fight for ownership of the cache line (assuming that each thread runs on a separate core).

To mitigate false sharing in your case, the easiest thing to do is to ensure that each thread gets an iteration block, whose size is divisible by the number of elements in the cache line. For example you can apply the schedule(static,something*8) where something should be big enough so that the iteration space is not fragmented into too many pieces, but in the same time it should be small enough so that each thread gets a block. E.g. for m_size equal to 999 and 4 threads you would apply the schedule(static,256) clause to the parallel for construct.

Another partial reason for the code to run slower might be that when OpenMP is enabled, the compiler might become reluctant to apply some code optimisations when shared variables are being assigned to. OpenMP provides for the so-called relaxed memory model where it is allowed that the local memory view of a shared variable in each threads is different and the flush construct is provided in order to synchronise the views. But compilers usually see shared variables as being implicitly volatile if they cannot prove that other threads would not need to access desynchronised shared variables. You case is one of those, since result[i] is only assigned to and the value of result[i] is never used by other threads. In the serial case the compiler would most likely create a temporary variable to hold the result from the inner loop and would only assign to result[i] once the inner loop has finished. In the parallel case it might decide that this would create a temporary desynchronised view of result[i] in the other threads and hence decide not to apply the optimisation. Just for the record, GCC 4.7.1 with -O3 -ftree-vectorize does the temporary variable trick with both OpenMP enabled and not.

156

answered Sep 21 '22 17:09

Hristo Iliev

Because when OpenMP distributes the work among threads there is a lot of administration/synchronisation going on to ensure the values in your shared matrix and vector are not corrupted somehow. Even though they are read-only: humans see that easily, your compiler may not.

Things to try out for pedagogic reasons:

0) What happens if matrix and vector are not shared?

1) Parallelize the inner "j-loop" first, keep the outer "i-loop" serial. See what happens.

2) Do not collect the sum in result[i], but in a variable temp and assign its contents to result[i] only after the inner loop is finished to avoid repeated index lookups. Don't forget to init temp to 0 before the inner loop starts.

answered Sep 19 '22 17:09

Laryx Decidua

Related questions
                            
                                hash function for src dest ip + port
                            
                                is it possible to make a function execute code from a string on the stack?
                            
                                OpenCV Kalman filter
                            
                                malloc undefined
                            
                                gdb weird backtrace
                            
                                MPI_Type_create_subarray and MPI_Gather
                            
                                return code of system()
                            
                                accessing physical memory from linux kernel
                            
                                Notation __no_init __root C
                            
                                How do I base58 encode a string?
                            
                                which free tools can I use to generate the program dependence graph for c codes
                            
                                Understanding poorly written code, 2nd year CS past paper [closed]
                            
                                setvbuf not able to make stdin unbuffered
                            
                                How to use fontconfig to get font list (C/C++)?
                            
                                How does the linker know where is the definition of an extern function?
                            
                                No loop condition in for and while loop
                            
                                Why scanf("%d", [...]) does not consume '\n'? while scanf("%c") does?
                            
                                Can I name a variable with the same name as a typedef'd structure name?
                            
                                Win32 clipboard and alpha channel images
                            
                                Use printf to print character string in hexadecimal format, distorted results

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With