| # Transform Feedback implementation on Metal back-end |
| |
| ### Overview |
| - OpenGL ES 3.0 introduces Transform Feedback as a way to capture vertex outputs to buffers before |
| the introduction of Compute Shader in later versions. |
| - Metal doesn't support Transform Feedback natively but it is possible to be emulated using Compute |
| Shader or Vertex Shader to write vertex outputs to buffers directly. |
| - If Vertex Shader writes to buffers directly as well as to stage output (i.e. `[[position]]`, |
| varying variables, ...) then the Metal runtime won't allow the `MTLRenderPipelineState` to be |
| created. It is only allowed to either write to buffers or to stage output not both on Metal. This |
| brings challenges to implement Transform Feedback when `GL_RASTERIZER_DISCARD` is not enabled, |
| because in that case, by right OpenGL will do both the Transform Feedback and rasterization |
| (feeding stage output to Fragment Shader) at the same time. |
| |
| ### Current implementation |
| - Transform Feedback will be implemented by inserting additional code snippet to write vertex's |
| varying variables to buffers called XFB buffers at compilation time. The buffers' offsets are |
| calculated based on `[[vertex_id]]`/`gl_VertexIndex` & `[[instance_id]]`/`gl_InstanceID`. |
| - When Transform Feedback ends, a memory barrier must be inserted because the XFB buffers could be |
| used as vertex inputs in future draw calls. Due to Metal not supporting explicit memory barrier |
| (currently only macOS 10.14 and above supports it, ARM based macOS doesn't though), the only |
| reliable way to insert memory barrier currently is ending the render pass. |
| - In order to support Transform Feedback capturing and rasterization at the same time, the draw call |
| must be split into 2 passes: |
| - First pass: Vertex Shader will write captured varyings to XFB buffers. |
| `MTLRenderPipelineState`'s rasterization will be disabled. This can be done in `spirv-cross` |
| translation step. `spirv-cross` can convert the Vertex Shader to a `void` function, |
| effectively won't produce any stage output values for Fragment Shader. |
| - Second pass: Vertex Shader will write to stage output normally, but the XFB buffers writing |
| snippet are disabled. Note that the Vertex Shader in this pass is essential the same as the |
| first pass's, only difference is the output route (stage output vs XFB buffers). This |
| effectively executes the same Vertex Shader's internal logic twice. |
| - If `GL_RASTERIZER_DISCARD` is enabled when Transform Feedback is enabled: |
| - Only first pass above will be executed, the render pass will use 1x1 empty texture attachment |
| because rasterization is not needed and small texture attachment's load & store at render |
| pass's start & end boundary could be cheap. Recall that we have to end the render pass to |
| enforce XFB buffers' memory barrier as mentioned above. |
| - If `GL_RASTERIZER_DISCARD` is enabled and Transform Feedback is NOT enabled, we cannot disable |
| `MTLRenderPipelineState`'s rasterization because if doing so, Metal runtime requires the Vertex |
| Shader to be a `void` function, i.e. not returning any stage output values. In order to |
| work-around this: |
| - `MTLRenderPipelineState`'s rasterization will still be enabled this case. |
| - However, the Vertex Shader will be translated to write `(-3, -3, -3, 1)` to |
| `[[position]]`/`gl_Position` variable at the end. Effectively forcing the vertex to be clipped |
| and preventing it from being sent down to Fragment Shader. Note that the `(-3, -3, -3, 1)` |
| writing are controlled by a specialized constant, thus it could be turned on and off base on |
| `GL_RASTERIZER_DISCARD` state. It is more efficient doing this way than re-translating the |
| whole shader code again using `spirv-cross` to turn it to a `void` function. |
| |
| ### Future improvements |
| - Use explicit memory barrier on macOS devices supporting it instead of ending the render pass. |
| - Instead of executing the same Vertex Shader's logic twice, one alternative approach is writing the |
| vertex outputs to a temporary buffer. Then in second pass, copy the varyings from that buffer to |
| XFB buffers. If rasterization is still enabled, then the 3rd pass will be invoked to use the |
| temporary buffer as vertex input, the Vertex Shader in 3rd pass might just a simple passthrough |
| shader: |
| 1. Original VS -> All outputs to temp buffer. |
| 2. Temp buffer -> Copy captured varying to XFB buffers. Could be done in a Compute Shader. |
| 3. Temp buffer -> VS pass through to FS for rasterization. |
| - However, this approach might even be slower than executing the Vertex Shader twice. Because a |
| memory barrier must be inserted after 1st step. This prevents multiple draw calls with Transform |
| Feedback to be parallelized. Furthermore, on iOS devices or devices not supporting explicit |
| barrier, the render pass must be ended and restarted after each draw call. |
| - Most of the time, the application usually uses Transform Feedback with `GL_RASTERIZER_DISCARD` |
| enabled, the original approach will just simply executes the Vertex Shader once and use a cheap |
| 1x1 render pass, thus it should be fast enough. |