I am using mkl
cblas_dgemm
and currently have it with CblasRowMajor
, CblasNoTrans
, CblasNotrans
, for my matrices.
I know that c
is a row major language, whereas dgemm
is a column major algorithm. I am interested to know if switching the ordering of the matrices will have any affect on the cblas_dgemm
algorithm if I am linking against mkl
. Is mkl
smart enough to do things behind the scenes that I would try to do to optimized the matrix multiplcations? If not, what is the best way to perform matrix multiplications with mkl
?
TL;DR: In short it does not matter whether you perform matrix-matrix multiplications using row-major or column-major ordering with MKL (and other BLAS implementations).
I know that c is a row major language, whereas dgemm is a column major algorithm.
DGEMM is not a column-major algorithm, it is the BLAS interface for computing a matrix-matrix product with general matrices. The common reference implementation for DGEMM (and most of BLAS) is Netlib's which is written in Fortran. The only reason it assumes column-major ordering is because Fortran is a column-major order language. DGEMM (and the corresponding BLAS Level 3 functions) is not specifically for column-major data.
DGEMM in basic maths performs a 2D matrix-matrix multiplication. The standard algorithm for multiplying 2D matrices requires you to traverse one matrix along its rows and the other along its columns. To perform the matrix-matrix multiplication, AB = C, we would multiply the rows of A by the columns of B to produce C. Therefore, the ordering of the input matrices does not matter as one matrix must be traversed along its rows and the other along its columns.
Intel MKL is smart enough to utilise this under the hood and provides the exact same performance for row-major and column-major data.
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, ...);
and
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, ...);
will execute with similar performance. We can test this with a relatively simple program
#include <float.h>
#include <mkl.h>
#include <omp.h>
#include <stdio.h>
void init_matrix(double *A, int n, int m, double d);
void test_dgemm(CBLAS_LAYOUT Layout, double *A, double *B, double *C, const MKL_INT m, const MKL_INT n, const MKL_INT k, int nSamples, double *timing);
void print_summary(const MKL_INT m, const MKL_INT n, const MKL_INT k, const int nSamples, const double *timing);
int main(int argc, char **argv) {
MKL_INT n, k, m;
double *a, *b, *c;
double *timing;
int nSamples = 1;
if (argc != 5){
fprintf(stderr, "Error: Wrong number of arguments!\n");
fprintf(stderr, "usage: %s mMatrix nMatrix kMatrix NSamples\n", argv[0]);
return -1;
}
m = atoi(argv[1]);
n = atoi(argv[2]);
k = atoi(argv[3]);
nSamples = atoi(argv[4]);
timing = malloc(nSamples * sizeof *timing);
a = mkl_malloc(m*k * sizeof *a, 64);
b = mkl_malloc(k*n * sizeof *a, 64);
c = mkl_calloc(m*n, sizeof *a, 64);
/** ROW-MAJOR ORDERING **/
test_dgemm(CblasRowMajor, a, b, c, m, n, k, nSamples, timing);
/** COLUMN-MAJOR ORDERING **/
test_dgemm(CblasColMajor, a, b, c, m, n, k, nSamples, timing);
mkl_free(a);
mkl_free(b);
mkl_free(c);
free(timing);
}
void init_matrix(double *A, int n, int m, double d) {
int i, j;
#pragma omp for schedule (static) private(i,j)
for (i = 0; i < n; ++i) {
for (j = 0; j < m; ++j) {
A[j + i*n] = d * (double) ((i - j) / n);
}
}
}
void test_dgemm(CBLAS_LAYOUT Layout, double *A, double *B, double *C, const MKL_INT m, const MKL_INT n, const MKL_INT k, int nSamples, double *timing) {
int i;
MKL_INT lda = m, ldb = k, ldc = m;
double alpha = 1.0, beta = 0.0;
if (CblasRowMajor == Layout) {
printf("\n*****ROW-MAJOR ORDERING*****\n\n");
} else if (CblasColMajor == Layout) {
printf("\n*****COLUMN-MAJOR ORDERING*****\n\n");
}
init_matrix(A, m, k, 0.5);
init_matrix(B, k, n, 0.75);
init_matrix(C, m, n, 0);
// First call performs any buffer/thread initialisation
cblas_dgemm(Layout, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
double tmin = DBL_MAX, tmax = 0.0;
for (i = 0; i < nSamples; ++i) {
init_matrix(A, m, k, 0.5);
init_matrix(B, k, n, 0.75);
init_matrix(C, m, n, 0);
timing[i] = dsecnd();
cblas_dgemm(Layout, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
timing[i] = dsecnd() - timing[i];
if (tmin > timing[i]) tmin = timing[i];
else if (tmax < timing[i]) tmax = timing[i];
}
print_summary(m, n, k, nSamples, timing);
}
void print_summary(const MKL_INT m, const MKL_INT n, const MKL_INT k, const int nSamples, const double *timing) {
int i;
double tavg = 0.0;
for(i = 0; i < nSamples; i++) {
tavg += timing[i];
}
tavg /= nSamples;
printf("#Loop | Sizes m n k | Time (s)\n");
for(i = 0; i < nSamples; i++) {
printf("%4d %12d %3d %3d %6.4f\n", i + 1 , m, n, k, timing[i]);
}
printf("Summary:\n");
printf("Sizes m n k | Avg. Time (s)\n");
printf(" %8d %3d %3d %12.8f\n", m, n, k, tavg);
}
on my system this produces
$ ./benchmark_dgemm 1000 1000 1000 5
*****ROW-MAJOR ORDERING*****
#Loop | Sizes m n k | Time (s)
1 1000 1000 1000 0.0589
2 1000 1000 1000 0.0596
3 1000 1000 1000 0.0603
4 1000 1000 1000 0.0626
5 1000 1000 1000 0.0584
Summary:
Sizes m n k | Avg. Time (s)
1000 1000 1000 0.05995692
*****COLUMN-MAJOR ORDERING*****
#Loop | Sizes m n k | Time (s)
1 1000 1000 1000 0.0597
2 1000 1000 1000 0.0610
3 1000 1000 1000 0.0581
4 1000 1000 1000 0.0594
5 1000 1000 1000 0.0596
Summary:
Sizes m n k | Avg. Time (s)
1000 1000 1000 0.05955171
where we can see that there is very little difference between the column-major ordering time and the row-major ordering time. 0.0595 seconds for column-major versus 0.0599 seconds for row-major. Executing this again might produce the following, where the row-major calculation is faster by 0.00003 seconds.
$ ./benchmark_dgemm 1000 1000 1000 5
*****ROW-MAJOR ORDERING*****
#Loop | Sizes m n k | Time (s)
1 1000 1000 1000 0.0674
2 1000 1000 1000 0.0598
3 1000 1000 1000 0.0595
4 1000 1000 1000 0.0587
5 1000 1000 1000 0.0584
Summary:
Sizes m n k | Avg. Time (s)
1000 1000 1000 0.06075310
*****COLUMN-MAJOR ORDERING*****
#Loop | Sizes m n k | Time (s)
1 1000 1000 1000 0.0634
2 1000 1000 1000 0.0596
3 1000 1000 1000 0.0582
4 1000 1000 1000 0.0582
5 1000 1000 1000 0.0645
Summary:
Sizes m n k | Avg. Time (s)
1000 1000 1000 0.06078266
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With