src/libANGLE/renderer/metal/doc/TransformFeedback.md - platform/external/angle - Git at Google

 # Transform Feedback implementation on Metal back-end

 ### Overview
 - OpenGL ES 3.0 introduces Transform Feedback as a way to capture vertex outputs to buffers before
   the introduction of Compute Shader in later versions.
 - Metal doesn't support Transform Feedback natively but it is possible to be emulated using Compute
   Shader or Vertex Shader to write vertex outputs to buffers directly.
 - If Vertex Shader writes to buffers directly as well as to stage output (i.e. `[[position]]`,
   varying variables, ...) then the Metal runtime won't allow the `MTLRenderPipelineState` to be
   created. It is only allowed to either write to buffers or to stage output not both on Metal. This
   brings challenges to implement Transform Feedback when `GL_RASTERIZER_DISCARD` is not enabled,
   because in that case, by right OpenGL will do both the Transform Feedback and rasterization
   (feeding stage output to Fragment Shader) at the same time.

 ### Current implementation
 - Transform Feedback will be implemented by inserting additional code snippet to write vertex's
   varying variables to buffers called XFB buffers at compilation time. The buffers' offsets are
   calculated based on `[[vertex_id]]`/`gl_VertexIndex` & `[[instance_id]]`/`gl_InstanceID`.
 - When Transform Feedback ends, a memory barrier must be inserted because the XFB buffers could be
   used as vertex inputs in future draw calls. Due to Metal not supporting explicit memory barrier
   (currently only macOS 10.14 and above supports it, ARM based macOS doesn't though), the only
   reliable way to insert memory barrier currently is ending the render pass.
 - In order to support Transform Feedback capturing and rasterization at the same time, the draw call
   must be split into 2 passes:
     - First pass: Vertex Shader will write captured varyings to XFB buffers.
       `MTLRenderPipelineState`'s rasterization will be disabled. This can be done in `spirv-cross`
       translation step. `spirv-cross` can convert the Vertex Shader to a `void` function,
       effectively won't produce any stage output values for Fragment Shader.
     - Second pass: Vertex Shader will write to stage output normally, but the XFB buffers writing
       snippet are disabled. Note that the Vertex Shader in this pass is essential the same as the
       first pass's, only difference is the output route (stage output vs XFB buffers). This
       effectively executes the same Vertex Shader's internal logic twice.
 - If `GL_RASTERIZER_DISCARD` is enabled when Transform Feedback is enabled:
     - Only first pass above will be executed, the render pass will use 1x1 empty texture attachment
       because rasterization is not needed and small texture attachment's load & store at render
       pass's start & end boundary could be cheap. Recall that we have to end the render pass to
       enforce XFB buffers' memory barrier as mentioned above.
 - If `GL_RASTERIZER_DISCARD` is enabled and Transform Feedback is NOT enabled, we cannot disable
   `MTLRenderPipelineState`'s rasterization because if doing so, Metal runtime requires the Vertex
   Shader to be a `void` function, i.e. not returning any stage output values. In order to
   work-around this:
     - `MTLRenderPipelineState`'s rasterization will still be enabled this case.
     - However, the Vertex Shader will be translated to write `(-3, -3, -3, 1)` to
       `[[position]]`/`gl_Position` variable at the end. Effectively forcing the vertex to be clipped
       and preventing it from being sent down to Fragment Shader. Note that the `(-3, -3, -3, 1)`
       writing are controlled by a specialized constant, thus it could be turned on and off base on
       `GL_RASTERIZER_DISCARD` state. It is more efficient doing this way than re-translating the
       whole shader code again using `spirv-cross` to turn it to a `void` function.

 ### Future improvements
 - Use explicit memory barrier on macOS devices supporting it instead of ending the render pass.
 - Instead of executing the same Vertex Shader's logic twice, one alternative approach is writing the
   vertex outputs to a temporary buffer. Then in second pass, copy the varyings from that buffer to
   XFB buffers. If rasterization is still enabled, then the 3rd pass will be invoked to use the
   temporary buffer as vertex input, the Vertex Shader in 3rd pass might just a simple passthrough
   shader:
     1. Original VS -> All outputs to temp buffer.
     2. Temp buffer -> Copy captured varying to XFB buffers. Could be done in a Compute Shader.
     3. Temp buffer -> VS pass through to FS for rasterization.
 - However, this approach might even be slower than executing the Vertex Shader twice. Because a
   memory barrier must be inserted after 1st step. This prevents multiple draw calls with Transform
   Feedback to be parallelized. Furthermore, on iOS devices or devices not supporting explicit
   barrier, the render pass must be ended and restarted after each draw call.
 - Most of the time, the application usually uses Transform Feedback with `GL_RASTERIZER_DISCARD`
   enabled, the original approach will just simply executes the Vertex Shader once and use a cheap
   1x1 render pass, thus it should be fast enough.
	# Transform Feedback implementation on Metal back-end

	### Overview
	- OpenGL ES 3.0 introduces Transform Feedback as a way to capture vertex outputs to buffers before
	the introduction of Compute Shader in later versions.
	- Metal doesn't support Transform Feedback natively but it is possible to be emulated using Compute
	Shader or Vertex Shader to write vertex outputs to buffers directly.
	- If Vertex Shader writes to buffers directly as well as to stage output (i.e. `[[position]]`,
	varying variables, ...) then the Metal runtime won't allow the `MTLRenderPipelineState` to be
	created. It is only allowed to either write to buffers or to stage output not both on Metal. This
	brings challenges to implement Transform Feedback when `GL_RASTERIZER_DISCARD` is not enabled,
	because in that case, by right OpenGL will do both the Transform Feedback and rasterization
	(feeding stage output to Fragment Shader) at the same time.

	### Current implementation
	- Transform Feedback will be implemented by inserting additional code snippet to write vertex's
	varying variables to buffers called XFB buffers at compilation time. The buffers' offsets are
	calculated based on `[[vertex_id]]`/`gl_VertexIndex` & `[[instance_id]]`/`gl_InstanceID`.
	- When Transform Feedback ends, a memory barrier must be inserted because the XFB buffers could be
	used as vertex inputs in future draw calls. Due to Metal not supporting explicit memory barrier
	(currently only macOS 10.14 and above supports it, ARM based macOS doesn't though), the only
	reliable way to insert memory barrier currently is ending the render pass.
	- In order to support Transform Feedback capturing and rasterization at the same time, the draw call
	must be split into 2 passes:
	- First pass: Vertex Shader will write captured varyings to XFB buffers.
	`MTLRenderPipelineState`'s rasterization will be disabled. This can be done in `spirv-cross`
	translation step. `spirv-cross` can convert the Vertex Shader to a `void` function,
	effectively won't produce any stage output values for Fragment Shader.
	- Second pass: Vertex Shader will write to stage output normally, but the XFB buffers writing
	snippet are disabled. Note that the Vertex Shader in this pass is essential the same as the
	first pass's, only difference is the output route (stage output vs XFB buffers). This
	effectively executes the same Vertex Shader's internal logic twice.
	- If `GL_RASTERIZER_DISCARD` is enabled when Transform Feedback is enabled:
	- Only first pass above will be executed, the render pass will use 1x1 empty texture attachment
	because rasterization is not needed and small texture attachment's load & store at render
	pass's start & end boundary could be cheap. Recall that we have to end the render pass to
	enforce XFB buffers' memory barrier as mentioned above.
	- If `GL_RASTERIZER_DISCARD` is enabled and Transform Feedback is NOT enabled, we cannot disable
	`MTLRenderPipelineState`'s rasterization because if doing so, Metal runtime requires the Vertex
	Shader to be a `void` function, i.e. not returning any stage output values. In order to
	work-around this:
	- `MTLRenderPipelineState`'s rasterization will still be enabled this case.
	- However, the Vertex Shader will be translated to write `(-3, -3, -3, 1)` to
	`[[position]]`/`gl_Position` variable at the end. Effectively forcing the vertex to be clipped
	and preventing it from being sent down to Fragment Shader. Note that the `(-3, -3, -3, 1)`
	writing are controlled by a specialized constant, thus it could be turned on and off base on
	`GL_RASTERIZER_DISCARD` state. It is more efficient doing this way than re-translating the
	whole shader code again using `spirv-cross` to turn it to a `void` function.

	### Future improvements
	- Use explicit memory barrier on macOS devices supporting it instead of ending the render pass.
	- Instead of executing the same Vertex Shader's logic twice, one alternative approach is writing the
	vertex outputs to a temporary buffer. Then in second pass, copy the varyings from that buffer to
	XFB buffers. If rasterization is still enabled, then the 3rd pass will be invoked to use the
	temporary buffer as vertex input, the Vertex Shader in 3rd pass might just a simple passthrough
	shader:
	1. Original VS -> All outputs to temp buffer.
	2. Temp buffer -> Copy captured varying to XFB buffers. Could be done in a Compute Shader.
	3. Temp buffer -> VS pass through to FS for rasterization.
	- However, this approach might even be slower than executing the Vertex Shader twice. Because a
	memory barrier must be inserted after 1st step. This prevents multiple draw calls with Transform
	Feedback to be parallelized. Furthermore, on iOS devices or devices not supporting explicit
	barrier, the render pass must be ended and restarted after each draw call.
	- Most of the time, the application usually uses Transform Feedback with `GL_RASTERIZER_DISCARD`
	enabled, the original approach will just simply executes the Vertex Shader once and use a cheap
	1x1 render pass, thus it should be fast enough.