Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Apple's Metal Matrix Multiplication Example needs padding C

I'm learning Apple's Metal trying to do some GPU computation.

I checked the matrix multiplication example given by Apple. There's a point I cannot understand.

In the file MetalMatrixMult.h

// Number of rows in matrices A and C.
@property (nonatomic) uint16_t m;

// Number of columns in matrix A; number of rows in matrix B.
@property (nonatomic) uint16_t n;

// Number of columns in matrices B and C.
@property (nonatomic) uint16_t k;

// Output matrix (padded) C row count
@property (nonatomic, readonly) uint16_t M;

// Output matrix (padded) C column count
@property (nonatomic, readonly) uint16_t K;

// Output matrix C = A x B
@property (nonatomic, readonly) float* output;

It says the Matrix C is padded. I'm not clear what pad means here. Is it some kind of alignment? Cause I know there are types alignment in Metal's shader language specification, but I don't know why we need to pad a buffer herer.

Thanks.

like image 206
Crt Tax Avatar asked Mar 28 '17 18:03

Crt Tax


People also ask

Why are GPU good at matrix multiplication?

In your case of matrix multiplication. You can parallelize the computations, Because GPU have much more threads and in each thread you have multiple blocks. So a lot of computations are parallelized, resulting quick computations.

What is the logic behind matrix multiplication?

For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The result matrix has the number of rows of the first and the number of columns of the second matrix.

How many loops are required to perform matrix multiplication operations?

This requires three nested loops. The outer loop traverses the m rows of A. For each row i, another loop must cycle through the n columns of B. For each column, form the sum of the products of corresponding elements from row i of A and column j of B.

What will be the syntax for matrix multiplication?

A = [1 1 0 0]; B = [1; 2; 3; 4]; Multiply A times B . The result is a 1-by-1 scalar, also called the dot product or inner product of the vectors A and B . Alternatively, you can calculate the dot product A ⋅ B with the syntax dot(A,B) .


1 Answers

It has to do with optimizing memory access. Your GPU has a number of threadgroups, each containing a relatively small amount of dedicated memory (a few KB) that can be accessed very quickly. This is separate from your GPU's main memory, which might be a few GBs of comparatively slow memory.

Since it's unlikely that all 3 matrices (A, B and C) can fit into a single threadgroup's memory, and falling back to main memory inside loops would be extremely slow, we divide the computation into "blocks" or sectors. Imagine dividing the result matrix C into a grid, where each sector is a collection of 8 x 8 elements: we can then instruct Threadgroup 1 to compute the result for the top-left sector while other threadgroups compute the other sectors simulataneously. In this case, Threadgroup 1 only needs the first 8 rows of A and the first 8 columns of B to compute its portion of C. This means we can send a much smaller amount of data to Threadgroup 1, keeping it well within the cache limit.

The reason Metal requires us to pad the matrices is so that it can divide C into a perfect grid. If your true result matrix is 12 x 18, and the sector size is 8 x 8, that means C is 1.5 x 2.25 sectors. The GPU can't efficiently operate on partial sectors, so you must pad your matrices with zeros to reach whole numbers - in this case 2 x 3 sectors or 16 x 24 elements. You sacrifice a little bit of storage and a few more clock cycles for highly optimized parallel processing.

like image 154
Hundley Avatar answered Sep 22 '22 08:09

Hundley