Memory write performance - GPU CPU Shared Memory

Tags:

I'm allocating both input and output MTLBuffer using posix_memalign according to the shared GPU/CPU documentation provided by memkite.

Aside: it is easier to just use latest API than muck around with posix_memalign

let metalBuffer = self.metalDevice.newBufferWithLength(byteCount, options: .StorageModeShared)

My kernel function operates on roughly 16 million complex value structs and writes out an equal number of complex value structs to memory.

I've performed some experiments and my Metal kernel 'complex math section' executes in 0.003 seconds (Yes!), but writing the result to the buffer takes >0.05 (No!) seconds. In my experiment I commented out the math-part and just assign the zero to memory and it takes 0.05 seconds, commenting out the assignment and adding the math back, 0.003 seconds.

Is the shared memory slow in this case, or is there some other tip or trick I might try?

Additional detail

Test platforms

iPhone 6S - ~0.039 seconds per frame
iPad Air 2 - ~0.130 seconds per frame

The streaming data

Each update to the shader receives approximately 50000 complex numbers in the form of a pair of float types in a struct.

struct ComplexNumber {
    float real;
    float imaginary;
};

Kernel signature

kernel void processChannelData(const device Parameters *parameters [[ buffer(0) ]],
                               const device ComplexNumber *inputSampleData [[ buffer(1) ]],
                               const device ComplexNumber *partAs [[ buffer(2) ]],
                               const device float *partBs [[ buffer(3) ]],
                               const device int *lookups [[ buffer(4) ]],
                               device float *outputImageData [[ buffer(5) ]],
                               uint threadIdentifier [[ thread_position_in_grid ]]);

All the buffers contain - currently - unchanging data except inputSampleData which receives the 50000 samples I'll be operating on. The other buffers contain roughly 16 million values (128 channels x 130000 pixels) each. I perform some operations on each 'pixel' and sum the complex result across channels and finally take the absolute value of the complex number and assign the resulting float to outputImageData.

Dispatch

commandEncoder.setComputePipelineState(pipelineState)

commandEncoder.setBuffer(parametersMetalBuffer, offset: 0, atIndex: 0)
commandEncoder.setBuffer(inputSampleDataMetalBuffer, offset: 0, atIndex: 1)
commandEncoder.setBuffer(partAsMetalBuffer, offset: 0, atIndex: 2)
commandEncoder.setBuffer(partBsMetalBuffer, offset: 0, atIndex: 3)
commandEncoder.setBuffer(lookupsMetalBuffer, offset: 0, atIndex: 4)
commandEncoder.setBuffer(outputImageDataMetalBuffer, offset: 0, atIndex: 5)

let threadExecutionWidth = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: 1, depth: 1)
let threadGroups = MTLSize(width: self.numberOfPixels / threadsPerThreadgroup.width, height: 1, depth:1)

commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
metalCommandBuffer.commit()
metalCommandBuffer.waitUntilCompleted()

GitHub example

I've written an example called Slow and put it up on GitHub. Seems the bottleneck is the write of the values into the input Buffer. So, I guess the question becomes how to avoid the bottleneck?

Memory copy

I wrote up a quick test to compare the performance of various byte copying methods.

Current Status

I've reduced execution time to 0.02ish seconds which doesn't sound like a lot, but it makes a big difference in the number of frames per second. Currently the biggest improvements are a result of switching to cblas_scopy().

970

asked Mar 08 '16 06:03

Cameron Lowell Palmer

2 Answers

Reduce the size of the type

Originally, I was pre-converting signed 16-bit sized Integers as Floats (32-bit) since ultimately that is how they'll be used. This is a case where performance starts forcing you to store the values as 16-bits to cut your data-size in half.

Objective-C over Swift

For the code dealing with movement of data, you might choose Objective-C over Swift (Warren Moore recommendation). Performance of Swift in these special situations still isn't up to scratch. You can also try calling out to memcpy or similar methods. I've seen a couple of examples that used for-loop Buffer Pointers and this in my experiments performed slowly.

Difficulty of testing

I really wanted to do some of the experiments with relation to various copying methods in a playground on the machine and unfortunately this was useless. The iOS device versions of the same experiments performed completely differently. One might think that the relative performance would be the similar, but I found this to also be an invalid assumption. It would be really convenient if you could have a playground that used the iOS device as the interpreter.

answered Oct 11 '22 12:10

Cameron Lowell Palmer

You might get a large speedup via encoding your data to huffman codes and decoding on the GPU, see MetalHuffman. It depends on your data though.

answered Oct 11 '22 11:10

MoDJ

Related questions
                            
                                Making a Button Call a Phone Number in iOS [duplicate]
                            
                                get progress from dataTaskWithURL in swift
                            
                                afnetworking 3.0 Migration: how to POST with headers and HTTP Body
                            
                                Encode NSString for XML/HTML
                            
                                iOS Random Number Generator to a new view
                            
                                Error building ios with cordova
                            
                                Xcode Project Files not appearing in Project Navigator
                            
                                JavaScript constructs/patterns to avoid on iOS Safari?
                            
                                Sign-In Required Popup in-App purchase issue?
                            
                                App reporting inaccurate Usage / Storage
                            
                                Versioning objects in firebase
                            
                                Firebase Dynamic Link doesn't work when installing app from appstore for the first time
                            
                                iOS video streaming and storing on device afterwards
                            
                                How to deal with inconsistent Chinese maps on iOS 6?
                            
                                IOS semaphore_wait_trap on main thread causing hang in UI
                            
                                EKParticipant in EventKit erroneously returns NO for isCurrentUser property
                            
                                Monotouch Three20 app launcher or bindings
                            
                                Could not signal service com.apple.WebKit.WebContent
                            
                                Download shared App Group Container from XCode
                            
                                CNContactStoreDidChangeNotification is fired multiple times

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory write performance - GPU CPU Shared Memory

Tags:

memory-management

ios

swift

metal

Additional detail

Test platforms

The streaming data

Kernel signature

Dispatch

GitHub example

Memory copy

Current Status

Cameron Lowell Palmer

People also ask

2 Answers

Reduce the size of the type

Objective-C over Swift

Difficulty of testing

Cameron Lowell Palmer

MoDJ

Recent Activity

Donate For Us