Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenMP C++ Matrix Multiplication run slower in parallel

I'm learning the basics of paralel execution of for loop using OpenMP.

Sadly, my paralel program runs 10x slower than serial version. What am I doing wrong? Am I missing some barriers?

double **basicMultiply(double **A, double **B, int size) {
   int i, j, k;
   double **res = createMatrix(size);
   omp_set_num_threads(4);
   #pragma omp parallel for private(k)
   for (i = 0; i < size; i++) {
      for (j = 0; j < size; j++) {
         for (k = 0; k < size; k++) {
            res[i][j] += A[i][k] * B[k][j];
         }
      }
   }
   return res;
}

Thank you very much!

like image 488
Hynek Blaha Avatar asked Mar 25 '14 12:03

Hynek Blaha


People also ask

What is the condition for matrix multiplication in C?

Introduction to Matrix Multiplication in C Two matrices can be multiplied only if the number of columns in the first matrix is equal to the number of rows in the second matrix. The product of the two matrices will have the order of a number of rows in the first row and the number of columns in the second matrix.

Why Matrix multiplication is faster?

As matrices grow larger, the number of multiplications needed to find their product increases much faster than the number of additions. While it takes eight intermediate multiplications to find the product of two-by-two matrices, it takes 64 to find the product of four-by-four matrices.

Can you parallelize matrix multiplication?

A matrix is a set of numerical and non-numerical data arranged in a fixed number of rows and column. Matrix multiplication is an important multiplication design in parallel computation.

Why is Matlab so fast in matrix multiplication?

Because MATLAB is a programming language at first developed for numerical linear algebra (matrix manipulations), which has libraries especially developed for matrix multiplications.


3 Answers

Your problem is due to a race condition on the inner loop variable j. It needs to be made private.

For C89 I would do something like this:

#pragma omp parallel
{
    int i, j, k;
    #pragma omp for
    for(i=0; ...

For C++ or C99 use mixed declarations

#pragma omp parallel for
for(int i=0; ...

Doing this you don't have to explicitly declare anything shared or private.

Some further comments to your code. Your single threaded code is not cache friendly when you do B[k][j]. This reads a cacheline then moves to the next cache line and so forth until the dot product is done by which time the other cachelines have been evicted. Instead you should take the transpose first and access as BT[j][k]. Additionally, you have allocated arrays of arrays and not one contiguous 2D array. I fixed your code to use the transpose and a contiguous 2D array.

Here are the times I get for size=512.

no transpose  no openmp 0.94s
no transpose, openmp    0.23s
tranpose, no openmp     0.27s
transpose, openmp       0.08s

Below is the code (also see http://coliru.stacked-crooked.com/a/ee174916fa035f97)

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void transpose(double *A, double *B, int n) {
    int i,j;
    for(i=0; i<n; i++) {
        for(j=0; j<n; j++) {
            B[j*n+i] = A[i*n+j];
        }
    }
}

void gemm(double *A, double *B, double *C, int n) 
{   
    int i, j, k;
    for (i = 0; i < n; i++) { 
        for (j = 0; j < n; j++) {
            double dot  = 0;
            for (k = 0; k < n; k++) {
                dot += A[i*n+k]*B[k*n+j];
            } 
            C[i*n+j ] = dot;
        }
    }
}

void gemm_omp(double *A, double *B, double *C, int n) 
{   
    #pragma omp parallel
    {
        int i, j, k;
        #pragma omp for
        for (i = 0; i < n; i++) { 
            for (j = 0; j < n; j++) {
                double dot  = 0;
                for (k = 0; k < n; k++) {
                    dot += A[i*n+k]*B[k*n+j];
                } 
                C[i*n+j ] = dot;
            }
        }

    }
}

void gemmT(double *A, double *B, double *C, int n) 
{   
    int i, j, k;
    double *B2;
    B2 = (double*)malloc(sizeof(double)*n*n);
    transpose(B,B2, n);
    for (i = 0; i < n; i++) { 
        for (j = 0; j < n; j++) {
            double dot  = 0;
            for (k = 0; k < n; k++) {
                dot += A[i*n+k]*B2[j*n+k];
            } 
            C[i*n+j ] = dot;
        }
    }
    free(B2);
}

void gemmT_omp(double *A, double *B, double *C, int n) 
{   
    double *B2;
    B2 = (double*)malloc(sizeof(double)*n*n);
    transpose(B,B2, n);
    #pragma omp parallel
    {
        int i, j, k;
        #pragma omp for
        for (i = 0; i < n; i++) { 
            for (j = 0; j < n; j++) {
                double dot  = 0;
                for (k = 0; k < n; k++) {
                    dot += A[i*n+k]*B2[j*n+k];
                } 
                C[i*n+j ] = dot;
            }
        }

    }
    free(B2);
}

int main() {
    int i, n;
    double *A, *B, *C, dtime;

    n=512;
    A = (double*)malloc(sizeof(double)*n*n);
    B = (double*)malloc(sizeof(double)*n*n);
    C = (double*)malloc(sizeof(double)*n*n);
    for(i=0; i<n*n; i++) { A[i] = rand()/RAND_MAX; B[i] = rand()/RAND_MAX;}

    dtime = omp_get_wtime();
    gemm(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemm_omp(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemmT(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemmT_omp(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    return 0;

}
like image 160
Z boson Avatar answered Oct 17 '22 22:10

Z boson


In addition. "Z boson", I have tested your C code on the laptop with intel i5 (2 physical cores or 4 logical ones). Unfortunately, the calculation speed is not very fast. For 2000x2000 random double matrices I obtained the following results (using VS 2010 with OpenMP 2.0):

Compiled for Win64: C = A*B, where A,B are matrices with the size (2000x2000):

max number of threads = 4
Create random matrices: = 0.303555 s
no transpose no openmp = 100.539924 s
no transpose, openmp = 47.876084 s
transpose, no openmp = 27.872169 s
transpose, openmp = 15.821010 s

Compiled for Win32: C = A*B, where A,B are matrices with the size (2000x2000):

max number of threads = 4
Create random matrices: = 0.378804 s
no transpose no openmp = 98.613992 s
no transpose, openmp = 48.233655 s
transpose, no openmp = 29.590350 s
transpose, openmp = 13.678097 s

Note that for the "Hynek Blaha" code the calculation time on my system is 739.208s (226.62s with openMP)!

Whereas in Matlab x64:

n = 2000; 
A = rand(n); B = rand(n);

tic
C = A*B;
toc

the calculation time is 0.591440 seconds.

But using openBLAS package I reached a speed of 0.377814 seconds (using minGW with openMP 4.0). The Armadillo package provides a simple way (in my opinion) for connection of matrix operations with openBLAS (or other similar packages). In this case the code is

#include <iostream>
#include <armadillo>
using namespace std;
using namespace arma;

int main(){
    int n = 2000;
    int N = 10; // number of repetitions
    wall_clock timer;

    arma_rng::set_seed_random();

    mat A(n, n, fill::randu), B(n, n, fill::randu);

    timer.tic();
    // repeat simulation N times
    for(int n=1;n<N;n++){
      mat C = A*B;
    }
    cout << timer.toc()/double(N) << "s" << endl;

    return 0;
}
like image 24
Alexander Korovin Avatar answered Oct 17 '22 22:10

Alexander Korovin


If size is small, the overhead of thread-synchronization will shadow any performance gain from parallel computation.

like image 20
rerx Avatar answered Oct 18 '22 00:10

rerx