Runtime of standard matrix multiplication algorithm slower than expected when scaled up (2^10+ x 2^10+ elements)

Question

I have a standard matrix multiplication algorithm.

// matrix.h

typedef struct Matrix {
    int rows;
    int columns;
    double *elements;
} Matrix;

double *addressAt(Matrix *matrix, int row, int column);
double *rowAt(Matrix *matrix, int row);
double dotProductColumnArray(Matrix *matrix, const int column, double *array);
void mmProduct(Matrix *first, Matrix *second, Matrix *result);

// matrix.c

// Helper function to find address of elements at row and column.
inline double *addressAt(Matrix *matrix, int row, int column) {
    return &(matrix->elements[row * matrix->columns + column]);
}

// Helper function to find row array
inline double *rowAt(Matrix *matrix, int row) {
    return &(matrix->elements[row * matrix->columns]);
}

// Finds the dot product of a column of a matrix and an array.
double dotProductColumnArray(Matrix *matrix, const int column, double *array) {
    double sum = 0.0;
    const int rows = matrix->rows, columns = matrix->columns;
    int j = column;
    const double *elements = matrix->elements;
    for (int i = 0; i < rows; i++) {
        sum += array[i] * elements[j];
        j += columns;
    }
    return sum;
}

void mmProduct(Matrix *first, Matrix *second, Matrix *result) {
    const int rows = result->rows;
    const int columns = result->columns;
    for (int i = 0; i < rows; i++) {
        double *row = rowAt(first, i);
        for (int j = 0; j < columns; j++) {
            *addressAt(result, i, j) = dotProductColumnArray(second, j, row);
        }
    }
}

// Snippet of main.c

// Fills the matrix with random numbers between -1 and 1
void randomFill(Matrix *matrix);

int main() {
    struct timeval timestamp;
    long start, now;
    Matrix first, second, result;
    
    // 2^10
    first = createMatrix((int)pow(2, 10), (int)pow(2, 10));
    second = createMatrix((int)pow(2, 10), (int)pow(2, 10));
    randomFill(&first);
    randomFill(&second);
    result = createMatrix((int)pow(2, 10), (int)pow(2, 10));

    gettimeofday(&timestamp, NULL);
    start = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
    mmProduct(&first, &second, &result);
    gettimeofday(&timestamp, NULL);
    now = timestamp.tv_sec * 1000 + timestamp.tv_usec / 1000;
    printf("Time taken: %ldms
", now - start);
    
    deleteMatrix(&first);
    deleteMatrix(&second);
    deleteMatrix(&result);
    
    // Same code as above but for 2^11, 2^12, replacing pow() when necessary.
    // ...
    // ...
}

Full code here to run it yourself: https://gist.github.com/kevinMEH/c37bda728bf5ee53c32a405ef611e5a7

When I run the algorithm with two 2¹⁰ x 2¹⁰ matrices, it takes about 2 seconds to multiply.

When I run the algorithm with two 2¹¹ x 2¹¹ matrices, I expect the algorithm to multiply them in 8 * 2 seconds ~ 16 seconds to multiply. However, it takes 36 seconds to multiply instead.

When I run the algorithm with two 2¹² x 2¹² matrices, I expect the algorithm to multiply them in 64 * 2000 milliseconds ~ 128 seconds to multiply. Worst case, I expect 8 * 36 = 288 seconds to multiply. However, it takes 391 seconds to multiply.

(Compiled with Apple Clang 14.0.0 with -O1, but with -Ofast and with no optimizations it scales similarly, still not by a factor of 8 but greater. Ran on Apple M1 chip.)

I'm perplexed as to why this is. The number of additions and multiplications operations performed is exactly (2ⁿ)³, and when I increase the width of the matrices by a factor of 2, I expect the subsequent multiplication operation to only take 2³ = 8 times longer each time.

Suspects:

The addressAt and rowAt helper functions is contributing to the slowdown: This does not make sense; the number of additions and multiplications scale by n³, but calls to addressAt only scale by n² and rowAt only scale by n, so we should actually see a relative increase in speed. (8 * (n³ + n² + n) > 8n³ + 4n² + 2n) I've also manually inlined the functions, and found there to be no noticeable speed increase, so it should not be because of the function call overheads.

Subnormal numbers: My random number generator gets a random u64, and divides it by 2 * 0x8..0u, yielding a number between 0 and 1, so there should be no subnormal numbers. Nonetheless, in the randomFill function I multiplied everything by 100000 and it was still the same result.

dotProductColumnArray(): I suspect that this is the culprit, but I cannot find a reason why. My best guess is that the slowdown is because of CPU prefetching the wrong column elements. When we are incrementing j by the number of columns, the CPU is confused by this and does not prefetch the array elements at matrix->elements[j + n * columns] because it did not expect j to be incremented by columns (so like it prefetches j, j + 1, or whatever, but not j + columns), but that does not make sense to me as columns is constant, so the CPU should know better.

Side note:

I also wrote mmStrassen() which implements the Strassen matrix multiplication algorithm, and when I tested it, it was expectedly 7 times longer each time. It was tested together following the regular mmProduct function, so it is not a running time issue.

Also any optimization recommendations is welcome, thank you.

Thanks and apologies in advance for the long question.

chux - Reinstate Monica · Accepted Answer

Candidate improvement: use restrict, const.

restrict allows the compiler to assume changing matrix[] does not affect array[] (i.e. - they do not overlap) and optimize accordingly. Without restrict, compiler must assume changing matrix[] may change array[] and necessitate doing everything one at a time, instead of some parallelism.

const might have some marginal benefit too.

// double dotProductColumnArray(
//     Matrix* matrix, const int column, double* array) {
double dotProductColumnArray(
    Matrix* restrict matrix, const int column, const double* restrict array) {

Likewise

// void mmProduct(Matrix* first, 
//     Matrix* second, Matrix* result) {
void mmProduct(const Matrix* restrict first, 
    const Matrix* restrict second, Matrix* restrict result) {

chqrlie · Answer

The matrix dimensions are exact powers of 2, which causes catastrophic cache behavior as reading the column values causes cache collisions almost every time.

You should achieve much faster execution and the expected complexity with a simple alternative: transpose the second matrix and implement transposed matrix multiplication where each coefficient of the destination matrix is the dot product of 2 matrix rows.

Here is a simplistic implementation:

void mmProduct2(Matrix *first, Matrix *second, Matrix *result) {
    Matrix t = createMatrix(second->columns, second->rows);
    transpose(second, &t);
    int resultIndex = 0;
    for(int i = 0; i < result->rows; i++) {
        double *row1 = rowAt(first, i);
        for(int j = 0; j < result->columns; j++) {
            double *row2 = rowAt(&t, j);
            double sum = 0.0;
            for (int k = 0; k < first->columns; k++) {
                sum += row1[k] * row2[k];
            }
            result->elements[resultIndex++] = sum;
        }
    }
    deleteMatrix(&t);
}

Running your tests on my M2 macbook, I first got 0ms timings for all 3 multiplications because using -O3, clang would detect that none of the computations' results were used, hence everything could be optimized out.

Adding a signature function and displaying its result forced the code generation:

double hashMatrix(const Matrix *matrix) {
    double sum = 0;
    int index = 0;
    for(int i = 0; i < matrix->rows; i++) {
        double sumrow = 0.0;
        for(int j = 0; j < matrix->columns; j++) {
            sumrow += matrix->elements[index++] * (j + 1);
        }
        sum += sumrow * (i + 1);
    }
    return sum;
}

The timings were consistent with yours:

Time taken: 1031ms, hash=-2.18033e+08
Time taken: 28893ms, hash=-5.78916e+09
Time taken: 245019ms, hash=5.04211e+11

Changing from mmProduct to mmProduct2 produces timings consistent cubic time complexity (using the same random data):

Time taken: 859ms, hash=-2.18033e+08
Time taken: 7080ms, hash=-5.78916e+09
Time taken: 57914ms, hash=5.04211e+11

Runtime of standard matrix multiplication algorithm slower than expected when scaled up (2^10+ x 2^10+ elements)

Tags:

performance

c

matrix

matrix-multiplication

Suspects:

Side note:

kmeh

2 Answers

chux - Reinstate Monica

chqrlie

Recent Activity

Donate For Us

Runtime of standard matrix multiplication algorithm slower than expected when scaled up (2^10+ x 2^10+ elements)

Tags:

performance

c

matrix

matrix-multiplication

Suspects:

Side note:

kmeh

2 Answers

chux - Reinstate Monica

chqrlie

Related questions

Recent Activity

Donate For Us