I'm trying to create a procedural game using Metal, and I'm using an octree based chunk approach for a Level of Detail implementation.
The method I'm using involves the CPU creating the octree nodes for the terrain, which then has its mesh created on the GPU using a compute shader. This mesh is stored in a vertex buffer and index buffer in the chunk object for rendering.
All of this seems to work fairly well, however when it comes to rendering chunks I'm hitting performance issues early on. Currently I gather an array of chunks to draw, then submit that to my renderer, that will create an MTLParallelRenderCommandEncoder
to then create an MTLRenderCommandEncoder
for each chunk, which is then submitted to the GPU.
By the looks of it around 50% of the CPU time is spent on creating the MTLRenderCommandEncoder
for each chunk. Currently I'm just creating a simple 8 vertex cube mesh for each chunk, and I have an 4x4x4 array of chunks and I'm dropping to around 50fps in these early stages. (In reality it seems that there can only be up to 63 MTLRenderCommandEncoder
in each MTLParallelRenderCommandEncoder
so it's not fully 4x4x4)
I've read that the point of the MTLParallelRenderCommandEncoder
is to create each MTLRenderCommandEncoder
in a separate thread, yet I've not had much luck with getting this to work. Also multithreading it wouldn't get around the cap of 63 chunks being rendered as a max.
I feel that somehow consolidating the vertex and index buffers for each chunk into one or two larger buffers for submission would help, but I'm not sure how to do this without copious memcpy()
calls and whether or not this would even improve efficiency.
Here's my code that takes in the array of nodes and draws them:
func drawNodes(nodes: [OctreeNode], inView view: AHMetalView){
// For control of several rotating buffers
dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)
makeDepthTexture()
updateUniformsForView(view, duration: view.frameDuration)
let commandBuffer = commandQueue.commandBuffer()
let optDrawable = layer.nextDrawable()
guard let drawable = optDrawable else{
return
}
let passDescriptor = MTLRenderPassDescriptor()
passDescriptor.colorAttachments[0].texture = drawable.texture
passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
passDescriptor.colorAttachments[0].storeAction = .Store
passDescriptor.colorAttachments[0].loadAction = .Clear
passDescriptor.depthAttachment.texture = depthTexture
passDescriptor.depthAttachment.clearDepth = 1
passDescriptor.depthAttachment.loadAction = .Clear
passDescriptor.depthAttachment.storeAction = .Store
let parallelRenderPass = commandBuffer.parallelRenderCommandEncoderWithDescriptor(passDescriptor)
// Currently 63 nodes as a maximum
for node in nodes{
// This line is taking up around 50% of the CPU time
let renderPass = parallelRenderPass.renderCommandEncoder()
renderPass.setRenderPipelineState(renderPipelineState)
renderPass.setDepthStencilState(depthStencilState)
renderPass.setFrontFacingWinding(.CounterClockwise)
renderPass.setCullMode(.Back)
let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex
renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)
renderPass.setTriangleFillMode(.Lines)
renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)
renderPass.endEncoding()
}
parallelRenderPass.endEncoding()
commandBuffer.presentDrawable(drawable)
commandBuffer.addCompletedHandler { (commandBuffer) -> Void in
self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
dispatch_semaphore_signal(self.displaySemaphore)
}
commandBuffer.commit()
}
You note:
I've read that the point of the
MTLParallelRenderCommandEncoder
is to create eachMTLRenderCommandEncoder
in a separate thread...
And you're correct. What you're doing is sequentially creating, encoding with, and ending command encoders — there's nothing parallel going on here, so MTLParallelRenderCommandEncoder
is doing nothing for you. You'd have roughly the same performance if you eliminated the parallel encoder and just created encoders with renderCommandEncoderWithDescriptor(_:)
on each pass through your for loop... which is to say, you'd still have the same performance problem due to the overhead of creating all those encoders.
So, if you're going to encode sequentially, just reuse the same encoder. Also, you should reuse as much of your other shared state as possible. Here's a quick pass at a possible refactoring (untested):
let passDescriptor = MTLRenderPassDescriptor()
// call this once before your render loop
func setup() {
makeDepthTexture()
passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
passDescriptor.colorAttachments[0].storeAction = .Store
passDescriptor.colorAttachments[0].loadAction = .Clear
passDescriptor.depthAttachment.texture = depthTexture
passDescriptor.depthAttachment.clearDepth = 1
passDescriptor.depthAttachment.loadAction = .Clear
passDescriptor.depthAttachment.storeAction = .Store
// set up render pipeline state and depthStencil state
}
func drawNodes(nodes: [OctreeNode], inView view: AHMetalView) {
updateUniformsForView(view, duration: view.frameDuration)
// Set up completed handler ahead of time
let commandBuffer = commandQueue.commandBuffer()
commandBuffer.addCompletedHandler { _ in // unused parameter
self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
dispatch_semaphore_signal(self.displaySemaphore)
}
// Semaphore should be tied to drawable acquisition
dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)
guard let drawable = layer.nextDrawable()
else { return }
// Set up the one part of the pass descriptor that changes per-frame
passDescriptor.colorAttachments[0].texture = drawable.texture
// Get one render pass descriptor and reuse it
let renderPass = commandBuffer.renderCommandEncoderWithDescriptor(passDescriptor)
renderPass.setTriangleFillMode(.Lines)
renderPass.setRenderPipelineState(renderPipelineState)
renderPass.setDepthStencilState(depthStencilState)
for node in nodes {
// Update offsets and draw
let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex
renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)
renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)
}
renderPass.endEncoding()
commandBuffer.presentDrawable(drawable)
commandBuffer.commit()
}
Then, profile with Instruments to see what, if any, further performance issues you might have. There's a great WWDC 2015 session about that showing several of the common "gotchas", how to diagnose them in profiling, and how to fix them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With