Improve Metal GPU utilization
Building a fast and efficient CPU<->GPU pipeline is not a trivial engineering task. Since the whole point of using a GPU in your computing is speed, you want to find the best way to use it. And the most obvious idea is to parallelize calculations between CPU and GPU. In this article, i want to show you a cool trick available since iOS 13 that will allow you to optimize your GPU usage.
But before we get into the tips and tricks section, let’s revisit standard Metal flow and hightlight the bottleneck.
Standard Metal flow
The entry point for interaction with the GPU in metal is device object (MTLDevice
), which creates a command queue (MTLCommandQueue
). And the command queue creates command buffers (MTLCommandBuffer
):
You create one command queue instance and use it across your application:
And use that instance wherever you need GPU access, creating command buffers:
In this code, the following happens: we encode commands into the buffer (it is very important to understand that at this point there is no real execution yet), then we commit the command buffer (send commands to the GPU) and wait for the GPU to complete its work. Thus, mixing CPU and GPU calculations in the code, we can see the following picture:
Dotted line means commit
+ waitUntilCompleted
calls.
The problem here is the lack of parallelism: the CPU waits for the GPU, and the GPU, in turn, does nothing while the CPU encodes commands.
To avoid idle processors, we will use a well-known approach.
Two-stage pipeline
or double buffering:
Dotted line means only the commit
call of the command buffer. This call is non-blocking, so CPU can continue to encode new commands, while GPU is executing current ones. Thus, we have divided our command buffer into several, and this allows us to use our processors much more efficiently. As i said, this approach is well known, but has some limitations:
- Your code gets more complex - you need to manage N command buffers instead of 1.
- You cannot use temporary MPS objects (
MPSTemporaryImage
, etc.) because they are alive as long as their parent command buffer lives.
The second point is an advanced feature, but very important if you are deep into metal programming, especially with MPS framework (metal performance shaders).
Question: How to avoid these limitations, but keep the benefits of double buffering?
Answer: Use MPSCommandBuffer.
MPSCommandBuffer
If you’re familiar with Metal framework, you might know that most entities like MTLDevice
and MTLCommandQueue
are protocols. MTLCommandBuffer
is also a protocol. MPSCommandBuffer
conforms to it, adding some extra features. The most useful of them is the commitAndContinue
method. MPSCommandBuffer
internally recreates the actual command buffer when you call commitAndContinue
and ensures that any temporary objects remain valid after the command buffer is recreated.
Nothing changes for you as a developer, except for the additional method, you write the same command encoding code, replacing let commandBuffer = commandQueue.makeCommandBuffer()!
with let commandBuffer = MPSCommandBuffer(from: commandQueue)
. And every time you call commitAndContinue
during command encoding, your GPU receives the next batch of commands:
Feel free to use MPSCommandBuffer
with any metal code, it’s not limited to metal performance shaders. This API is especially useful, when your CPU part is heavy.
Bonus
A small code snippet:
Thanks for reading!