Why Apple's Metal Matrix Multiplication Example needs padding C

Tags:

I'm learning Apple's Metal trying to do some GPU computation.

I checked the matrix multiplication example given by Apple. There's a point I cannot understand.

In the file MetalMatrixMult.h

// Number of rows in matrices A and C.
@property (nonatomic) uint16_t m;

// Number of columns in matrix A; number of rows in matrix B.
@property (nonatomic) uint16_t n;

// Number of columns in matrices B and C.
@property (nonatomic) uint16_t k;

// Output matrix (padded) C row count
@property (nonatomic, readonly) uint16_t M;

// Output matrix (padded) C column count
@property (nonatomic, readonly) uint16_t K;

// Output matrix C = A x B
@property (nonatomic, readonly) float* output;

It says the Matrix C is padded. I'm not clear what pad means here. Is it some kind of alignment? Cause I know there are types alignment in Metal's shader language specification, but I don't know why we need to pad a buffer herer.

Thanks.

206

asked Mar 28 '17 18:03

Crt Tax

1 Answers

It has to do with optimizing memory access. Your GPU has a number of threadgroups, each containing a relatively small amount of dedicated memory (a few KB) that can be accessed very quickly. This is separate from your GPU's main memory, which might be a few GBs of comparatively slow memory.

Since it's unlikely that all 3 matrices (A, B and C) can fit into a single threadgroup's memory, and falling back to main memory inside loops would be extremely slow, we divide the computation into "blocks" or sectors. Imagine dividing the result matrix C into a grid, where each sector is a collection of 8 x 8 elements: we can then instruct Threadgroup 1 to compute the result for the top-left sector while other threadgroups compute the other sectors simulataneously. In this case, Threadgroup 1 only needs the first 8 rows of A and the first 8 columns of B to compute its portion of C. This means we can send a much smaller amount of data to Threadgroup 1, keeping it well within the cache limit.

The reason Metal requires us to pad the matrices is so that it can divide C into a perfect grid. If your true result matrix is 12 x 18, and the sector size is 8 x 8, that means C is 1.5 x 2.25 sectors. The GPU can't efficiently operate on partial sectors, so you must pad your matrices with zeros to reach whole numbers - in this case 2 x 3 sectors or 16 x 24 elements. You sacrifice a little bit of storage and a few more clock cycles for highly optimized parallel processing.

154

answered Sep 22 '22 08:09

Hundley

Related questions
                            
                                Today Extension with UICollectionView different behaviour compared to Single View Application
                            
                                iOS SWIFT: Unable to delete user from Firebase Database
                            
                                How to detect 304 statusCode with Alamofire
                            
                                How to draw gradient with SKKeyframeSequence: as per Apple docs
                            
                                I want to make effect similar to resizing top view on contacts app?
                            
                                swift: How to take screenshot of AVPlayerLayer()
                            
                                React native - dynamically add a view onPress
                            
                                How to save dynamic webpage in cache using UIWebview in Swift 3
                            
                                Set App Entry point programmatically in AppDelegate
                            
                                Use of unresolved identifier FIRDatabase when using Firebase
                            
                                Read More/Less with Swift 3
                            
                                Making a Swift OSS library compatible with Objective-C
                            
                                Save photo with geolocation data to photo library Swift 3
                            
                                How to deselect input field when tapping return on mobile keyboard (iOS)
                            
                                App size on app store is 7x larger than uploaded app size
                            
                                Catching a UICollectionView perform batch updates assertion failure in Swift?
                            
                                Swift 3 Record audio and upload to Firebase storage and play back
                            
                                How To Use RCT_EXPORT_MODULE React Native
                            
                                How to make SKScene have fixed width?
                            
                                Swift 2 iOS - get file list sorted by creation date - more concise solution?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why Apple's Metal Matrix Multiplication Example needs padding C

Tags:

ios

objective-c

gpu

metal

Crt Tax

People also ask

1 Answers

Hundley

Recent Activity

Donate For Us