How to Speed Up Metal Code for iOS/Mac OS

Tags:

I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const device int& dataSize [[ buffer(1) ]],
                     const device float *filterVector [[ buffer(2) ]],
                     const device int& filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]]) {
    int outputSize = dataSize - filterSize + 1;
    for (int i=0;i<outputSize;i++) {
        float sum = 0.0;
        for (int j=0;j<filterSize;j++) {
            sum += dataVector[i+j] * filterVector[j];
        }
        outVector[i] = sum;
    }
}

My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?

981

asked Aug 19 '16 19:08

Epsilon

1 Answers

The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const constant int &dataSize [[ buffer(1) ]],
                     const constant float *filterVector [[ buffer(2) ]],
                     const constant int &filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]])
{
    float sum = 0.0;
    for (int i = 0; i < filterSize; ++i) {
        sum += dataVector[id + i] * filterVector[i];
    }
    outVector[id] = sum;
}

In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.

// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).

let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)

let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()

In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.

answered Nov 13 '22 15:11

warrenm

Related questions
                            
                                Custom animation in Launch Screen
                            
                                Custom Keyboard Name without "KeyboardName - HostAppName" scheme
                            
                                Why textFieldShouldReturn is still working when i return false?
                            
                                Justify-content value of space-between not working for iOS Chrome and Safari browsers
                            
                                Disable auto layout animation for one view
                            
                                How to clear SFSafariViewController credentials?
                            
                                How to detect if my code is running on WatchKit or iOS using Swift?
                            
                                UIButton#titleLabel reverts after being changed programmatically
                            
                                What is the difference between organic and non-organic App installs? [closed]
                            
                                Is iOS App Store Over-The-Air download limit based on 'Download Size' or 'Install Size'?
                            
                                iOS Safari Private Browsing localStorage and sessionStorage Support?
                            
                                UIButton shadow layer effect
                            
                                Get Thumb image and title of particular Website URL in iOS?
                            
                                Equivalent on Objective C/Swift of Java ThreadLocal Variables
                            
                                Swift + Realm newbie: Problems with a simple Realm object and its initializers
                            
                                Xamarin.Forms - Force ListView layout to redraw
                            
                                UITableViewRowAction with image instead of title
                            
                                iOS 10.0 Speech Recognition Error kAFAssistantErrorDomain
                            
                                Undestanding UIBezierPath curving mechanism, controlPoint and the curve point
                            
                                Animate a circle from the center in Swift

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Speed Up Metal Code for iOS/Mac OS

Tags:

macos

ios

metal

Epsilon

People also ask

1 Answers

warrenm

Recent Activity

Donate For Us