Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does MKL optimize cblas for *major order?

I am using mkl cblas_dgemm and currently have it with CblasRowMajor, CblasNoTrans, CblasNotrans, for my matrices.

I know that c is a row major language, whereas dgemm is a column major algorithm. I am interested to know if switching the ordering of the matrices will have any affect on the cblas_dgemm algorithm if I am linking against mkl. Is mkl smart enough to do things behind the scenes that I would try to do to optimized the matrix multiplcations? If not, what is the best way to perform matrix multiplications with mkl?

like image 457
drjrm3 Avatar asked Sep 07 '15 20:09

drjrm3


1 Answers

TL;DR: In short it does not matter whether you perform matrix-matrix multiplications using row-major or column-major ordering with MKL (and other BLAS implementations).


I know that c is a row major language, whereas dgemm is a column major algorithm.

DGEMM is not a column-major algorithm, it is the BLAS interface for computing a matrix-matrix product with general matrices. The common reference implementation for DGEMM (and most of BLAS) is Netlib's which is written in Fortran. The only reason it assumes column-major ordering is because Fortran is a column-major order language. DGEMM (and the corresponding BLAS Level 3 functions) is not specifically for column-major data.

What Does DGEMM Compute?

DGEMM in basic maths performs a 2D matrix-matrix multiplication. The standard Order N^3 algorithm for multiplying 2D matrices requires you to traverse one matrix along its rows and the other along its columns. To perform the matrix-matrix multiplication, AB = C, we would multiply the rows of A by the columns of B to produce C. Therefore, the ordering of the input matrices does not matter as one matrix must be traversed along its rows and the other along its columns.

Investigating Row-Major and Column-Major DGEMM Computation with MKL

Intel MKL is smart enough to utilise this under the hood and provides the exact same performance for row-major and column-major data.

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, ...);

and

cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, ...);

will execute with similar performance. We can test this with a relatively simple program

#include <float.h>
#include <mkl.h>
#include <omp.h>
#include <stdio.h>

void init_matrix(double *A, int n, int m, double d);
void test_dgemm(CBLAS_LAYOUT Layout, double *A, double *B, double *C, const MKL_INT m, const MKL_INT n, const MKL_INT k, int nSamples, double *timing);
void print_summary(const MKL_INT m, const MKL_INT n, const MKL_INT k, const int nSamples, const double *timing);

int main(int argc, char **argv) {
    MKL_INT n, k, m;
    double *a, *b, *c;
    double *timing;
    int nSamples = 1;

    if (argc != 5){
        fprintf(stderr, "Error: Wrong number of arguments!\n");
        fprintf(stderr, "usage: %s mMatrix nMatrix kMatrix NSamples\n", argv[0]);
        return -1;
    }

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    k = atoi(argv[3]);

    nSamples = atoi(argv[4]);

    timing = malloc(nSamples * sizeof *timing);

    a = mkl_malloc(m*k * sizeof *a, 64);
    b = mkl_malloc(k*n * sizeof *a, 64);
    c = mkl_calloc(m*n, sizeof *a, 64);

    /** ROW-MAJOR ORDERING **/
    test_dgemm(CblasRowMajor, a, b, c, m, n, k, nSamples, timing);

    /** COLUMN-MAJOR ORDERING **/
    test_dgemm(CblasColMajor, a, b, c, m, n, k, nSamples, timing);

    mkl_free(a);
    mkl_free(b);
    mkl_free(c);
    free(timing);
}

void init_matrix(double *A, int n, int m, double d) {
    int i, j;
    #pragma omp for schedule (static) private(i,j)
    for (i = 0; i < n; ++i) {
        for (j = 0; j < m; ++j) {
            A[j + i*n] = d * (double) ((i - j) / n);
        }
    }
}

void test_dgemm(CBLAS_LAYOUT Layout, double *A, double *B, double *C, const MKL_INT m, const MKL_INT n, const MKL_INT k, int nSamples, double *timing) {
    int i;
    MKL_INT lda = m, ldb = k, ldc = m;
    double alpha = 1.0, beta = 0.0;

    if (CblasRowMajor == Layout) {
        printf("\n*****ROW-MAJOR ORDERING*****\n\n");
    } else if (CblasColMajor == Layout) {
        printf("\n*****COLUMN-MAJOR ORDERING*****\n\n");
    }

    init_matrix(A, m, k, 0.5);
    init_matrix(B, k, n, 0.75);
    init_matrix(C, m, n, 0);

    // First call performs any buffer/thread initialisation
    cblas_dgemm(Layout, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

    double tmin = DBL_MAX, tmax = 0.0;
    for (i = 0; i < nSamples; ++i) {
        init_matrix(A, m, k, 0.5);
        init_matrix(B, k, n, 0.75);
        init_matrix(C, m, n, 0);

        timing[i] = dsecnd();
        cblas_dgemm(Layout, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
        timing[i] = dsecnd() - timing[i];

        if (tmin > timing[i]) tmin = timing[i];
        else if (tmax < timing[i]) tmax = timing[i];
    }

    print_summary(m, n, k, nSamples, timing);
}

void print_summary(const MKL_INT m, const MKL_INT n, const MKL_INT k, const int nSamples, const double *timing) {
    int i;

    double tavg = 0.0;
    for(i = 0; i < nSamples; i++) {
        tavg += timing[i];
    }
    tavg /= nSamples;

    printf("#Loop | Sizes  m   n   k  | Time (s)\n");
    for(i = 0; i < nSamples; i++) {
        printf("%4d %12d %3d %3d  %6.4f\n", i + 1 , m, n, k, timing[i]);
    }

    printf("Summary:\n");
    printf("Sizes  m   n   k  | Avg. Time (s)\n");
    printf(" %8d %3d %3d %12.8f\n", m, n, k, tavg);
}

on my system this produces

$ ./benchmark_dgemm 1000 1000 1000 5
*****ROW-MAJOR ORDERING*****

#Loop | Sizes  m   n   k  | Time (s)
   1         1000 1000 1000  0.0589
   2         1000 1000 1000  0.0596
   3         1000 1000 1000  0.0603
   4         1000 1000 1000  0.0626
   5         1000 1000 1000  0.0584
Summary:
Sizes  m   n   k  | Avg. Time (s)
     1000 1000 1000   0.05995692

*****COLUMN-MAJOR ORDERING*****

#Loop | Sizes  m   n   k  | Time (s)
   1         1000 1000 1000  0.0597
   2         1000 1000 1000  0.0610
   3         1000 1000 1000  0.0581
   4         1000 1000 1000  0.0594
   5         1000 1000 1000  0.0596
Summary:
Sizes  m   n   k  | Avg. Time (s)
     1000 1000 1000   0.05955171

where we can see that there is very little difference between the column-major ordering time and the row-major ordering time. 0.0595 seconds for column-major versus 0.0599 seconds for row-major. Executing this again might produce the following, where the row-major calculation is faster by 0.00003 seconds.

$ ./benchmark_dgemm 1000 1000 1000 5
*****ROW-MAJOR ORDERING*****

#Loop | Sizes  m   n   k  | Time (s)
   1         1000 1000 1000  0.0674
   2         1000 1000 1000  0.0598
   3         1000 1000 1000  0.0595
   4         1000 1000 1000  0.0587
   5         1000 1000 1000  0.0584
Summary:
Sizes  m   n   k  | Avg. Time (s)
     1000 1000 1000   0.06075310

*****COLUMN-MAJOR ORDERING*****

#Loop | Sizes  m   n   k  | Time (s)
   1         1000 1000 1000  0.0634
   2         1000 1000 1000  0.0596
   3         1000 1000 1000  0.0582
   4         1000 1000 1000  0.0582
   5         1000 1000 1000  0.0645
Summary:
Sizes  m   n   k  | Avg. Time (s)
     1000 1000 1000   0.06078266
like image 129
IKavanagh Avatar answered Sep 20 '22 04:09

IKavanagh