docs/drivers/vc4.rst - platform/external/mesa3d - Git at Google

 VC4
 ===

 Mesa's VC4 graphics driver supports multiple implementations of
 Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0
 through Raspberry Pi 3 hardware, and the driver is included as an
 option as of the 2016-02-09 Raspbian release using ``raspi-config``.
 On most other distributions such as Debian or Fedora, you need no
 configuration to enable the driver.

 This Mesa driver talks directly to the `VC4
 <https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM
 driver for scheduling graphics commands, and that module also provides
 KMS display support.  The driver makes no use of the closed source VPU
 firmware on the VideoCore IV block, instead talking directly to the
 GPU block from Linux.

 GLES2 support
 -------------

 The VC4 driver is a nearly conformant GLES2 driver, and the hardware
 has achieved GLES2 conformance with other driver stacks.

 OpenGL support
 --------------

 Along with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is
 mostly correct but with a few caveats.

 * 4-byte index buffers.

 GLES2.0, and VC4, don't have ``GL_UNSIGNED_INT`` index buffers. To support
 them in VC4, we create a shadow copy of your index buffer with the
 indices truncated to 2 bytes. This is incorrect (and will assertion
 fail in debug builds of Mesa) if any of the indices were >65535. To
 fix that, we would need to detect this case and rewrite the index
 buffer and vertex buffers to do a series of draws each with small
 indices and new vertex attrib bindings.

 To avoid this problem, ensure that all index buffers are written using
 ``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls
 with updated vertex attrib bindings.

 * Occlusion queries

 The VC4 hardware has no support for occlusion queries.  GL 2.0
 requires that you support the occlusion queries extension, but you can
 report 0 from ``glGetQueryiv(GL_SAMPLES_PASSED,
 GL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles
 "we want the functions to be present everywhere, but we want it to be
 optional for hardware to support it. Sadly, gallium doesn't yet allow
 the driver to report 0 query bits.

 * Primitive mode

 VC4 doesn't support reducing triangles/quads/polygons to lines and
 points like desktop GL. If front/back mode matched, we could rewrite
 the index buffer to the new primitive type, but we don't. If
 front/back mode don't match, we would need to run the vertex shader in
 software, classify the prims, write new index buffers, and emit
 (possibly many) new draw calls to rasterize the new prims in the same
 order.

 Bug Reporting
 -------------

 VC4 rendering bugs should go to Mesa's GitLab `issues
 <https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page.

 By far the easiest way to communicate bug reports for rendering
 problems is to take an apitrace. This passes exactly the drawing you
 saw to the developer, without the developer needing to download and
 build the application and replicate whatever steps you took to produce
 the problem.  Traces attached to bug reports should ideally be small.

 For GPU hangs, if you can get a short apitrace that produces the
 problem, that's still the best.  If the problem takes a long time to
 reproduce or you can't capture it in a trace, describing how to
 reproduce and including a GPU hang dump would be the most
 useful. Install `vc4-gpu-tools
 <https://github.com/anholt/vc4-gpu-tools/>`__ and use
 ``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will
 provide useful information.

 Tiled Rendering
 ---------------

 VC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or
 32x32 (MSAA) tiles and rendering the scene per tile. Rasterization
 looks like::

     (CPU) Allocate space to store a list of draw commands per tile
     (CPU) Set up a command list per tile that does:
         Either load the current tile's color buffer from memory, or clear it.
         Either load the current tile's depth buffer from memory, or clear it.
         Branch into the draw list for the tile
         Store the depth buffer if anybody might read it.
         Store the color buffer if anybody might read it.
     (GPU) Initialize the per-tile draw call lists to empty.
     (GPU) Run all draw calls collecting vertex data
     (GPU) For each tile covered by a draw call's primitive.
         Emit state packets to the list to update it to the current draw call's state.
         Emit a primitive description into the tile's draw call list.

 Tiled rendering avoids the need for large render target caches, at the
 expense of increasing the cost of vertex processing. Unlike some tiled
 renderers, VC4 has no non-tiled rendering mode.

 Performance Tricks
 ------------------

 * Reducing memory bandwidth by clearing.

 Even if your drawing is going to cover the entire render target, it's
 more efficient for VC4 if you emit a ``glClear()`` of the color and
 depth buffers. This means we can skip the load of the previous state
 from memory, in favor of a cheap GPU-side ``memset()`` of the tile
 buffer before we start running the draw calls.

 * Reducing memory bandwidth with scissoring.

 If all draw calls for the frame are with a ``glScissor()`` to only
 part of the screen, then we can skip setting up the tiles for that
 area, which means a little less memory used setting up the empty bins,
 and a lot less memory used loading/storing the unchanged tiles.

 * Reducing memory bandwidth with ``glInvalidateFramebuffer()``.

 If we don't know who might use the contents of the framebuffer's depth
 or color in the future, then we have to store it for later. If you use
 glInvalidateFramebuffer() before accessing the results of your
 rendering, then we can skip the store of the depth or color
 buffer. Note that this is unimplemented.

 * Avoid non-constant GLSL array indexing

 In VC4 the only non-constant-index array access supported in hardware
 is uniforms. For everything else (inputs, outputs, temporaries), we
 have to lower them to an IF ladder like::

   if (index == 0)
      return array[0]
   else if (index == 1)
     return array[1]
   ...

 This is very expensive as we probably have to execute every branch of
 every IF statement due to it being a SIMD machine. So, it is
 recommended (if you can) to avoid non-uniform non-constant array
 indexing.

 Note that if you do variable indexing within a bounded loop that Mesa
 can unroll, that can actually count as constant indexing.

 * Increasing GPU memory Increase CMA pool size

 The memory for the VC4 driver is allocated from the standard Linux CMA
 pool. The size of this pool defaults to 64 MB.  To increase this, pass
 an additional parameter on the kernel command line.  Edit the boot
 partition's ``cmdline.txt`` to add::

   cma=256M@256M

 ``cmdline.txt`` is a single line with whitespace separated parameters.

 The first value is the size of the pool and the second parameter is
 the start address of the pool. The pool size can be increased further,
 but it must fit into the memory, so size + start address must be below
 1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this
 reduces the memory available to Linux.

 * Decrease firmware memory

 The firmware allocates a fixed chunk of memory before booting
 Linux. If firmware functions are not required, this amount can be
 reduced.

 In ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding,
 edit gpu_mem to 64 if you need video decoding.

 Performance debugging
 ---------------------

 * Step 1: Known issues

 The first tool to look at is running your application with the
 environment variable ``VC4_DEBUG=perf`` set. This will report debug
 information for many known causes of performance problems on the
 console. Not all of them will cause visible performance improvements
 when fixed, but it's a good first step to see what might going wrong.

 * Step 2: CPU vs GPU

 The primary question is figuring out whether the CPU is busy in your
 application, the CPU is busy in the GL driver, the GPU is waiting for
 the CPU, or the CPU is waiting for the GPU. Ideally, you get to the
 point where the CPU is waiting for the GPU infrequently but for a
 significant amount of time (however long it takes the GPU to draw a
 frame).

 Start with top while your application is running. Is the CPU usage
 around 90%+? If so, then our performance analysis will be with
 sysprof. If it's not very high, is the GPU staying busy? We don't have
 a clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be
 useful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that
 means that the GPU is currently busy processing some rendering job.

 * sysprof for CPU usage

 If the CPU is totally busy and the GPU isn't terribly busy, there is
 an excellent tool for debugging: sysprof. Install, run as root (so you
 can get system-wide profiling), hit play and later stop. The top-left
 area shows the flat profile sorted by total time of that symbol plus
 its descendants. The top few are generally uninteresting (main() and
 its descendants consuming a lot), but eventually you can get down to
 something interesting. Click it, and to the right you get the
 callchains to descendants -- where all that time actually went. On the
 other hand, the lower left shows callers -- double-clicking those
 selects that as the symbol to view, instead.

 Note that you need debug symbols for the callgraphs in sysprof to
 work, which is where most of its value is. Most distributions offer
 debug symbol packages from their builds which can be installed
 separately, and sysprof will find them. I've found that on arm, the
 debug packages are not enough, and if someone could determine what is
 necessary for callgraphs in debugging, that would be really helpful.

 * perf for CPU waits on GPU

 If the CPU is not very busy and the GPU is not very busy, then we're
 probably ping-ponging between the two. Most cases of this would be
 noticed by ``VC4_DEBUG=perf``, but not all. To see all cases where
 this happens, use the perf tool from the Linux kernel (note: unrelated
 to ``VC4_DEBUG=perf``)::

     sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena

 If you want to see the whole system's stalls for a period of time
 (very useful!), use the -a flag instead of a particular command
 name. Just ``^C`` when you're done capturing data.

 At exit, you'll have ``perf.data`` in the current directory. You can print
 out the results with::

     perf report | less

 * Debugging for GPU fully busy

 As of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's
 performance counters in OpenGL. Install apitrace, and trace your
 application with::

     apitrace trace <application>          # for GLX applications
     apitrace trace -a egl <application>   # for EGL applications

 Once you've captured a trace, you can see what counters are available
 and replay it while looking while looking at some of those counters::

     apitrace replay <application>.trace --list-metrics

     apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading

 Multiple counters can be captured at once with commas separating them.

 Once you've found what draw calls are surprisingly expensive in one of
 the counters, you can work out which ones they were at the GL level by
 opening the trace up in qapitrace and using ``^-G`` to jump to that call
 number and ``^-L`` to look up the GL state at that call.

 Trace Testing
 -------------

 shader-db is often used as a proxy for real-world app performance when
 working on the compiler in Mesa.  On VC4, there is a lot of
 state-dependent code in the shaders (like blending or vertex attribute
 format handling), so the typical `shader-db
 <https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important
 areas for optimization.  Piglit can instead test apitraces, such as
 those captured in
 `traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__.

 Hardware Documentation
 ----------------------

 For driver developers, Broadcom publicly released a `specification
 <https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which
 is closely related to the VC4 GPU present in the Raspberry Pi.  They
 also released a `snapshot <https://docs.broadcom.com/docs/12358546>`__
 of a corresponding Android graphics driver.  That graphics driver was
 ported to Raspbian for a demo, but was not expected to have ongoing
 development.

 Developers with NDA access with Broadcom or Raspberry Pi can
 potentially get access to "simpenrose", the C software simulator of
 the GPU.  The Mesa driver includes a backend (``vc4_simulator.c``) to
 use simpenrose from an x86 system with the i915 graphics driver with
 all of the VC4 rendering commands emulated on simpenrose and memcpyed
 to the real GPU.
	VC4
	===

	Mesa's VC4 graphics driver supports multiple implementations of
	Broadcom's VideoCore IV GPU. It is notably used in the Raspberry Pi 0
	through Raspberry Pi 3 hardware, and the driver is included as an
	option as of the 2016-02-09 Raspbian release using ``raspi-config``.
	On most other distributions such as Debian or Fedora, you need no
	configuration to enable the driver.

	This Mesa driver talks directly to the `VC4
	<https://www.kernel.org/doc/html/latest/gpu/vc4.html>`__ kernel DRM
	driver for scheduling graphics commands, and that module also provides
	KMS display support. The driver makes no use of the closed source VPU
	firmware on the VideoCore IV block, instead talking directly to the
	GPU block from Linux.

	GLES2 support
	-------------

	The VC4 driver is a nearly conformant GLES2 driver, and the hardware
	has achieved GLES2 conformance with other driver stacks.

	OpenGL support
	--------------

	Along with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is
	mostly correct but with a few caveats.

	* 4-byte index buffers.

	GLES2.0, and VC4, don't have ``GL_UNSIGNED_INT`` index buffers. To support
	them in VC4, we create a shadow copy of your index buffer with the
	indices truncated to 2 bytes. This is incorrect (and will assertion
	fail in debug builds of Mesa) if any of the indices were >65535. To
	fix that, we would need to detect this case and rewrite the index
	buffer and vertex buffers to do a series of draws each with small
	indices and new vertex attrib bindings.

	To avoid this problem, ensure that all index buffers are written using
	``GL_UNSIGNED_SHORT``, even at the cost of doing multiple draw calls
	with updated vertex attrib bindings.

	* Occlusion queries

	The VC4 hardware has no support for occlusion queries. GL 2.0
	requires that you support the occlusion queries extension, but you can
	report 0 from ``glGetQueryiv(GL_SAMPLES_PASSED,
	GL_QUERY_COUNTER_BITS)``. This is absurd, but it's how OpenGL handles
	"we want the functions to be present everywhere, but we want it to be
	optional for hardware to support it. Sadly, gallium doesn't yet allow
	the driver to report 0 query bits.

	* Primitive mode

	VC4 doesn't support reducing triangles/quads/polygons to lines and
	points like desktop GL. If front/back mode matched, we could rewrite
	the index buffer to the new primitive type, but we don't. If
	front/back mode don't match, we would need to run the vertex shader in
	software, classify the prims, write new index buffers, and emit
	(possibly many) new draw calls to rasterize the new prims in the same
	order.

	Bug Reporting
	-------------

	VC4 rendering bugs should go to Mesa's GitLab `issues
	<https://gitlab.freedesktop.org/mesa/mesa/-/issues>`__ page.

	By far the easiest way to communicate bug reports for rendering
	problems is to take an apitrace. This passes exactly the drawing you
	saw to the developer, without the developer needing to download and
	build the application and replicate whatever steps you took to produce
	the problem. Traces attached to bug reports should ideally be small.

	For GPU hangs, if you can get a short apitrace that produces the
	problem, that's still the best. If the problem takes a long time to
	reproduce or you can't capture it in a trace, describing how to
	reproduce and including a GPU hang dump would be the most
	useful. Install `vc4-gpu-tools
	<https://github.com/anholt/vc4-gpu-tools/>`__ and use
	``vc4_dump_hang_state my-app.hang``. Sometimes the hang file will
	provide useful information.

	Tiled Rendering
	---------------

	VC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or
	32x32 (MSAA) tiles and rendering the scene per tile. Rasterization
	looks like::

	(CPU) Allocate space to store a list of draw commands per tile
	(CPU) Set up a command list per tile that does:
	Either load the current tile's color buffer from memory, or clear it.
	Either load the current tile's depth buffer from memory, or clear it.
	Branch into the draw list for the tile
	Store the depth buffer if anybody might read it.
	Store the color buffer if anybody might read it.
	(GPU) Initialize the per-tile draw call lists to empty.
	(GPU) Run all draw calls collecting vertex data
	(GPU) For each tile covered by a draw call's primitive.
	Emit state packets to the list to update it to the current draw call's state.
	Emit a primitive description into the tile's draw call list.

	Tiled rendering avoids the need for large render target caches, at the
	expense of increasing the cost of vertex processing. Unlike some tiled
	renderers, VC4 has no non-tiled rendering mode.

	Performance Tricks
	------------------

	* Reducing memory bandwidth by clearing.

	Even if your drawing is going to cover the entire render target, it's
	more efficient for VC4 if you emit a ``glClear()`` of the color and
	depth buffers. This means we can skip the load of the previous state
	from memory, in favor of a cheap GPU-side ``memset()`` of the tile
	buffer before we start running the draw calls.

	* Reducing memory bandwidth with scissoring.

	If all draw calls for the frame are with a ``glScissor()`` to only
	part of the screen, then we can skip setting up the tiles for that
	area, which means a little less memory used setting up the empty bins,
	and a lot less memory used loading/storing the unchanged tiles.

	* Reducing memory bandwidth with ``glInvalidateFramebuffer()``.

	If we don't know who might use the contents of the framebuffer's depth
	or color in the future, then we have to store it for later. If you use
	glInvalidateFramebuffer() before accessing the results of your
	rendering, then we can skip the store of the depth or color
	buffer. Note that this is unimplemented.

	* Avoid non-constant GLSL array indexing

	In VC4 the only non-constant-index array access supported in hardware
	is uniforms. For everything else (inputs, outputs, temporaries), we
	have to lower them to an IF ladder like::

	if (index == 0)
	return array[0]
	else if (index == 1)
	return array[1]
	...

	This is very expensive as we probably have to execute every branch of
	every IF statement due to it being a SIMD machine. So, it is
	recommended (if you can) to avoid non-uniform non-constant array
	indexing.

	Note that if you do variable indexing within a bounded loop that Mesa
	can unroll, that can actually count as constant indexing.

	* Increasing GPU memory Increase CMA pool size

	The memory for the VC4 driver is allocated from the standard Linux CMA
	pool. The size of this pool defaults to 64 MB. To increase this, pass
	an additional parameter on the kernel command line. Edit the boot
	partition's ``cmdline.txt`` to add::

	cma=256M@256M

	``cmdline.txt`` is a single line with whitespace separated parameters.

	The first value is the size of the pool and the second parameter is
	the start address of the pool. The pool size can be increased further,
	but it must fit into the memory, so size + start address must be below
	1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this
	reduces the memory available to Linux.

	* Decrease firmware memory

	The firmware allocates a fixed chunk of memory before booting
	Linux. If firmware functions are not required, this amount can be
	reduced.

	In ``config.txt`` edit ``gpu_mem`` to 16, if you do not need video decoding,
	edit gpu_mem to 64 if you need video decoding.

	Performance debugging
	---------------------

	* Step 1: Known issues

	The first tool to look at is running your application with the
	environment variable ``VC4_DEBUG=perf`` set. This will report debug
	information for many known causes of performance problems on the
	console. Not all of them will cause visible performance improvements
	when fixed, but it's a good first step to see what might going wrong.

	* Step 2: CPU vs GPU

	The primary question is figuring out whether the CPU is busy in your
	application, the CPU is busy in the GL driver, the GPU is waiting for
	the CPU, or the CPU is waiting for the GPU. Ideally, you get to the
	point where the CPU is waiting for the GPU infrequently but for a
	significant amount of time (however long it takes the GPU to draw a
	frame).

	Start with top while your application is running. Is the CPU usage
	around 90%+? If so, then our performance analysis will be with
	sysprof. If it's not very high, is the GPU staying busy? We don't have
	a clean tool for this yet, but ``cat /debug/dri/0/v3d_regs`` could be
	useful. If ``CT0CA`` != ``CT0EA`` or ``CT1CA`` != ``CT1EA``, that
	means that the GPU is currently busy processing some rendering job.

	* sysprof for CPU usage

	If the CPU is totally busy and the GPU isn't terribly busy, there is
	an excellent tool for debugging: sysprof. Install, run as root (so you
	can get system-wide profiling), hit play and later stop. The top-left
	area shows the flat profile sorted by total time of that symbol plus
	its descendants. The top few are generally uninteresting (main() and
	its descendants consuming a lot), but eventually you can get down to
	something interesting. Click it, and to the right you get the
	callchains to descendants -- where all that time actually went. On the
	other hand, the lower left shows callers -- double-clicking those
	selects that as the symbol to view, instead.

	Note that you need debug symbols for the callgraphs in sysprof to
	work, which is where most of its value is. Most distributions offer
	debug symbol packages from their builds which can be installed
	separately, and sysprof will find them. I've found that on arm, the
	debug packages are not enough, and if someone could determine what is
	necessary for callgraphs in debugging, that would be really helpful.

	* perf for CPU waits on GPU

	If the CPU is not very busy and the GPU is not very busy, then we're
	probably ping-ponging between the two. Most cases of this would be
	noticed by ``VC4_DEBUG=perf``, but not all. To see all cases where
	this happens, use the perf tool from the Linux kernel (note: unrelated
	to ``VC4_DEBUG=perf``)::

	sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena

	If you want to see the whole system's stalls for a period of time
	(very useful!), use the -a flag instead of a particular command
	name. Just ``^C`` when you're done capturing data.

	At exit, you'll have ``perf.data`` in the current directory. You can print
	out the results with::

	perf report \| less

	* Debugging for GPU fully busy

	As of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware's
	performance counters in OpenGL. Install apitrace, and trace your
	application with::

	apitrace trace <application> # for GLX applications
	apitrace trace -a egl <application> # for EGL applications

	Once you've captured a trace, you can see what counters are available
	and replay it while looking while looking at some of those counters::

	apitrace replay <application>.trace --list-metrics

	apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading

	Multiple counters can be captured at once with commas separating them.

	Once you've found what draw calls are surprisingly expensive in one of
	the counters, you can work out which ones they were at the GL level by
	opening the trace up in qapitrace and using ``^-G`` to jump to that call
	number and ``^-L`` to look up the GL state at that call.

	Trace Testing
	-------------

	shader-db is often used as a proxy for real-world app performance when
	working on the compiler in Mesa. On VC4, there is a lot of
	state-dependent code in the shaders (like blending or vertex attribute
	format handling), so the typical `shader-db
	<https://gitlab.freedesktop.org/mesa/shader-db>`__ will miss important
	areas for optimization. Piglit can instead test apitraces, such as
	those captured in
	`traces-db <https://gitlab.freedesktop.org/gfx-ci/tracie/traces-db/>`__.

	Hardware Documentation
	----------------------

	For driver developers, Broadcom publicly released a `specification
	<https://docs.broadcom.com/doc/12358545>`__ PDF for the 21553, which
	is closely related to the VC4 GPU present in the Raspberry Pi. They
	also released a `snapshot <https://docs.broadcom.com/docs/12358546>`__
	of a corresponding Android graphics driver. That graphics driver was
	ported to Raspbian for a demo, but was not expected to have ongoing
	development.

	Developers with NDA access with Broadcom or Raspberry Pi can
	potentially get access to "simpenrose", the C software simulator of
	the GPU. The Mesa driver includes a backend (``vc4_simulator.c``) to
	use simpenrose from an x86 system with the i915 graphics driver with
	all of the VC4 rendering commands emulated on simpenrose and memcpyed
	to the real GPU.