| Freedreno |
| ========= |
| |
| Freedreno GLES and GL driver for Adreno 2xx-6xx GPUs. It implements up to |
| OpenGL ES 3.2 and desktop OpenGL 4.5. |
| |
| See the `Freedreno Wiki |
| <https://gitlab.freedesktop.org/freedreno/freedreno/-/wikis/home>`__ for more |
| details. |
| |
| Turnip |
| ------ |
| |
| Turnip is a Vulkan 1.3 driver for Adreno 6xx GPUs. |
| |
| The current set of specific chip versions supported can be found in |
| :file:`src/freedreno/common/freedreno_devices.py`. The current set of features |
| supported can be found rendered at `Mesa Matrix <https://mesamatrix.net/>`__. |
| There are no plans to port to a5xx or earlier GPUs. |
| |
| Hardware architecture |
| --------------------- |
| |
| Adreno is a mostly tile-mode renderer, but with the option to bypass tiling |
| ("gmem") and render directly to system memory ("sysmem"). It is UMA, using |
| mostly write combined memory but with the ability to map some buffers as cache |
| coherent with the CPU. |
| |
| .. toctree:: |
| :glob: |
| |
| freedreno/hw/* |
| |
| Hardware acronyms |
| ^^^^^^^^^^^^^^^^^ |
| |
| .. glossary:: |
| |
| Cluster |
| A group of hardware registers, often with multiple copies to allow |
| pipelining. There is an M:N relationship between hardware blocks that do |
| work and the clusters of registers for the state that hardware blocks use. |
| |
| CP |
| Command Processor. Reads the stream of state changes and draw commands |
| generated by the driver. |
| |
| PFP |
| Prefetch Parser. Adreno 2xx-4xx CP component. |
| |
| ME |
| Micro Engine. Adreno 2xx-4xx CP component after PFP, handles most PM4 commands. |
| |
| SQE |
| a6xx+ replacement for PFP/ME. This is the microcontroller that runs the |
| microcode (loaded from Linux) which actually processes the command stream |
| and writes to the hardware registers. See `afuc |
| <https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/afuc/README.rst>`__. |
| |
| ROQ |
| DMA engine used by the SQE for reading memory, with some prefetch buffering. |
| Mostly reads in the command stream, but also serves for |
| ``CP_MEMCPY``/``CP_MEM_TO_REG`` and visibility stream reads. |
| |
| SP |
| Shader Processor. Unified, scalar shader engine. One or more, depending on |
| GPU and tier. |
| |
| TP |
| Texture Processor. |
| |
| UCHE |
| Unified L2 Cache. 32KB on A330, unclear how big now. |
| |
| CCU |
| Color Cache Unit. |
| |
| VSC |
| Visibility Stream Compressor |
| |
| PVS |
| Primitive Visibility Stream |
| |
| FE |
| Front End? Index buffer and vertex attribute fetch cluster. Includes PC, |
| VFD, VPC. |
| |
| VFD |
| Vertex Fetch and Decode |
| |
| VPC |
| Varying/Position Cache? Hardware block that stores shaded vertex data for |
| primitive assembly. |
| |
| HLSQ |
| High Level Sequencer. Manages state for the SPs, batches up PS invocations |
| between primitives, is involved in preemption. |
| |
| PC_VS |
| Cluster where varyings are read from VPC and assembled into primitives to |
| feed GRAS. |
| |
| VS |
| Vertex Shader. Responsible for generating VS/GS/tess invocations |
| |
| GRAS |
| Rasterizer. Responsible for generating PS invocations from primitives, also |
| does LRZ |
| |
| PS |
| Pixel Shader. |
| |
| RB |
| Render Backend. Performs both early and late Z testing, blending, and |
| attachment stores of output of the PS. |
| |
| GMEM |
| Roughly 128KB-1MB of memory on the GPU (SKU-dependent), used to store |
| attachments during tiled rendering |
| |
| LRZ |
| Low Resolution Z. A low resolution area of the depth buffer that can be |
| initialized during the binning pass to contain the worst-case (farthest) Z |
| values in a block, and then used to early reject fragments during |
| rasterization. |
| |
| Cache hierarchy |
| ^^^^^^^^^^^^^^^ |
| |
| The a6xx GPUs have two main caches: CCU and UCHE. |
| |
| UCHE (Unified L2 Cache) is the cache behind the vertex fetch, VSC writes, |
| texture L1, LRZ, and storage image accesses (``ldib``/``stib``). Misses and |
| flushes access system memory. |
| |
| The CCU is the separate cache used by 2D blits and sysmem render target access |
| (and also for resolves to system memory when in GMEM mode). Its memory comes |
| from a carveout of GMEM controlled by ``RB_CCU_CNTL``, with a varying amount |
| reserved based on whether we're in a render pass using GMEM for attachment |
| storage, or we're doing sysmem rendering. Cache entries have the attachment |
| number and layer mixed into the cache tag in some way, likely so that a |
| fragment's access is spread through the cache even if the attachments are the |
| same size and alignments in address space. This means that the cache must be |
| flushed and invalidated between memory being used for one attachment and another |
| (notably depth vs color, but also MRT color). |
| |
| The Texture Processors (TP) additionally have a small L1 cache (1KB on A330, |
| unclear how big now) before accessing UCHE. This cache is used for normal |
| sampling like ``sam``` and ``isam`` (and the compiler will make read-only |
| storage image access through it as well). It is not coherent with UCHE (may get |
| stale results when you ``sam`` after ``stib``), but must get flushed per draw or |
| something because you don't need a manual invalidate between draws storing to an |
| image and draws sampling from a texture. |
| |
| The command processor (CP) does not read from either of these caches, and |
| instead uses FIFOs in the ROQ to avoid stalls reading from system memory. |
| |
| Draw states |
| ^^^^^^^^^^^ |
| |
| Since the SQE is not a fast processor, and tiled rendering means that many draws |
| won't even be used in many bins, since a5xx state updates can be batched up into |
| "draw states" that point to a fragment of CP packets. At draw time, if the draw |
| call is going to actually execute (some primitive is visible in the current |
| tile), the SQE goes through the ``GROUP_ID``\s and for any with an update since |
| the last time they were executed, it executes the corresponding fragment. |
| |
| Starting with a6xx, states can be tagged with whether they should be executed |
| at draw time for any of sysmem, binning, or tile rendering. This allows a |
| single command stream to be generated which can be executed in any of the modes, |
| unlike pre-a6xx where we had to generate separate command lists for the binning |
| and rendering phases. |
| |
| Note that this means that the generated draw state has to always update all of |
| the state you have chosen to pack into that ``GROUP_ID``, since any of your |
| previous state changes in a previous draw state command may have been skipped. |
| |
| Pipelining (a6xx+) |
| ^^^^^^^^^^^^^^^^^^ |
| |
| Most CP commands write to registers. In a6xx+, the registers are located in |
| clusters corresponding to the stage of the pipeline they are used from (see |
| ``enum tu_stage`` for a list). To pipeline state updates and drawing, registers |
| generally have two copies ("contexts") in their cluster, so previous draws can |
| be working on the previous set of register state while the next draw's state is |
| being set up. You can find what registers go into which clusters by looking at |
| :command:`crashdec` output in the ``regs-name: CP_MEMPOOL`` section. |
| |
| As SQE processes register writes in the command stream, it sends them into a |
| per-cluster queue stored in ``CP_MEMPOOL``. This allows the pipeline stages to |
| process their stream of register updates and events independent of each other |
| (so even with just 2 contexts in a stage, earlier stages can proceed on to later |
| draws before later stages have caught up). |
| |
| Each cluster has a per-context bit indicating that the context is done/free. |
| Register writes will stall on the context being done. |
| |
| During a 3D draw command, SQE generates several internal events flow through the |
| pipeline: |
| |
| - ``CP_EVENT_START`` clears the done bit for the context when written to the |
| cluster |
| - ``PC_EVENT_CMD``/``PC_DRAW_CMD``/``HLSQ_EVENT_CMD``/``HLSQ_DRAW_CMD`` kick off |
| the actual event/drawing. |
| - ``CONTEXT_DONE`` event completes after the event/draw is complete and sets the |
| done flag. |
| - ``CP_EVENT_END`` waits for the done flag on the next context, then copies all |
| the registers that were dirtied in this context to that one. |
| |
| The 2D blit engine has its own ``CP_2D_EVENT_START``, ``CP_2D_EVENT_END``, |
| ``CONTEXT_DONE_2D``, so 2D and 3D register contexts can do separate context |
| rollover. |
| |
| Because the clusters proceed independently of each other even across draws, if |
| you need to synchronize an earlier cluster to the output of a later one, then |
| you will need to ``CP_WAIT_FOR_IDLE`` after flushing and invalidating any |
| necessary caches. |
| |
| Also, note that some registers are not banked at all, and will require a |
| ``CP_WAIT_FOR_IDLE`` for any previous usage of the register to complete. |
| |
| In a2xx-a4xx, there weren't per-stage clusters, and instead there were two |
| register banks that were flipped between per draw. |
| |
| Bindless/Bindful Descriptors (a6xx+) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Starting with a6xx++, cat5 (texture) and cat6 (image/SSBO/UBO) instructions are |
| extended to support bindless descriptors. |
| |
| In the old bindful model, descriptors are separate for textures, samplers, |
| UBOs, and IBOs (combined descriptor for images and SSBOs), with separate |
| registers for the memory containing the array of descriptors, and/or different |
| ``STATE_TYPE`` and ``STATE_BLOCK`` for ``CP_LOAD_STATE``/``_FRAG``/``_GEOM`` |
| to pre-load the descriptors into cache. |
| |
| - textures - per-shader-stage |
| - registers: ``SP_xS_TEX_CONST``/``SP_xS_TEX_COUNT`` |
| - state-type: ``ST6_CONSTANTS`` |
| - state-block: ``SB6_xS_TEX`` |
| - samplers - per-shader-stage |
| - registers: ``SP_xS_TEX_SAMP`` |
| - state-type: ``ST6_SHADER`` |
| - state-block: ``SB6_xS_TEX`` |
| - UBOs - per-shader-stage |
| - registers: none |
| - state-type: ``ST6_UBO`` |
| - state-block: ``SB6_xS_SHADER`` |
| - IBOs - global across shader 3d stages, separate for compute shader |
| - registers: ``SP_IBO``/``SP_IBO_COUNT`` or ``SP_CS_IBO``/``SP_CS_IBO_COUNT`` |
| - state-type: ``ST6_SHADER`` |
| - state-block: ``ST6_IBO`` or ``ST6_CS_IBO`` for compute shaders |
| - Note, unlike per-shader-stage descriptors, ``CP_LOAD_STATE6`` is used, |
| as opposed to ``CP_LOAD_STATE6_GEOM`` or ``CP_LOAD_STATE6_FRAG`` |
| depending on shader stage. |
| |
| .. note:: |
| For the per-shader-stage registers and state-blocks the ``xS`` notation |
| refers to per-shader-stage names, ex. ``SP_FS_TEX_CONST`` or ``SB6_DS_TEX`` |
| |
| Textures and IBOs (images) use *basically* the same 64byte descriptor format |
| with some exceptions (for ex, for IBOs cubemaps are handles as 2d array). |
| SSBOs are just untyped buffers, but otherwise use the same descriptors and |
| instructions as images. Samplers use a 16byte descriptor, and UBOs use an |
| 8byte descriptor which packs the size in the upper 15 bits of the UBO address. |
| |
| In the bindless model, descriptors are split into 5 descriptor sets, which are |
| global across shader stages (but as with bindful IBO descriptors, separate for |
| 3d stages vs compute stage). Each HW descriptor is an array of descriptors |
| of configurable size (each descriptor set can be configured for a descriptor |
| pitch of 8bytes or 64bytes). Each descriptor can be of arbitrary format (ie. |
| UBOs/IBOs/textures/samplers interleaved), it's interpretation by the HW is |
| determined by the instruction that references the descriptor. Each descriptor |
| set can contain at least 2^^16 descriptors. |
| |
| The HW is configured with the base address of the descriptor set via an array |
| of "BINDLESS_BASE" registers, ie ``SP_BINDLESS_BASE[n]``/``HLSQ_BINDLESS_BASE[n]`` |
| for 3d shader stages, or ``SP_CS_BINDLESS_BASE[n]``/``HLSQ_CS_BINDLESS_BASE[n]`` |
| for compute shaders, with the descriptor pitch encoded in the low bits. |
| Which of the descriptor sets is referenced is encoded via three bits in the |
| instruction. The address of the descriptor is calculated as:: |
| |
| descriptor_addr = (BINDLESS_BASE[n] & ~0x3) + |
| (idx * 4 * (2 << BINDLESS_BASE[n] & 0x3)) |
| |
| |
| .. note:: |
| Turnip reserves one descriptor set for internal use and exposes the other |
| four for the application via the Vulkan API. |
| |
| Software Architecture |
| --------------------- |
| |
| Freedreno and Turnip use a shared core for shader compiler, image layout, and |
| register and command stream definitions. They implement separate state |
| management and command stream generation. |
| |
| .. toctree:: |
| :glob: |
| |
| freedreno/* |
| |
| GPU devcoredump |
| ^^^^^^^^^^^^^^^^^^ |
| |
| A kernel message from DRM of "gpu fault" can mean any sort of error reported by |
| the GPU (including its internal hang detection). If a fault in GPU address |
| space happened, you should expect to find a message from the iommu, with the |
| faulting address and a hardware unit involved: |
| |
| .. code-block:: text |
| |
| *** gpu fault: ttbr0=000000001c941000 iova=000000010066a000 dir=READ type=TRANSLATION source=TP|VFD (0,0,0,1) |
| |
| On a GPU fault or hang, a GPU core dump is taken by the DRM driver and saved to |
| ``/sys/devices/virtual/devcoredump/**/data``. You can cp that file to a |
| :file:`crash.devcore` to save it, otherwise the kernel will expire it |
| eventually. Echo 1 to the file to free the core early, as another core won't be |
| taken until then. |
| |
| Once you have your core file, you can use :command:`crashdec -f crash.devcore` |
| to decode it. The output will have ``ESTIMATED CRASH LOCATION`` where we |
| estimate the CP to have stopped. Note that it is expected that this will be |
| some distance past whatever state triggered the fault, given GPU pipelining, and |
| will often be at some ``CP_REG_TO_MEM`` (which waits on previous WFIs) or |
| ``CP_WAIT_FOR_ME`` (which waits for all register writes to land) or similar |
| event. You can try running the workload with ``TU_DEBUG=flushall`` or |
| ``FD_MESA_DEBUG=flush`` to try to close in on the failing commands. |
| |
| You can also find what commands were queued up to each cluster in the |
| ``regs-name: CP_MEMPOOL`` section. |
| |
| If ``ESTIMATED CRASH LOCATION`` doesn't exist you could find ``CP_SQE_STAT``, |
| though going here is the last resort and likely won't be helpful. |
| |
| .. code-block:: |
| |
| indexed-registers: |
| - regs-name: CP_SQE_STAT |
| dwords: 51 |
| PC: 00d7 <------------- |
| PKT: CP_LOAD_STATE6_FRAG |
| $01: 70348003 $11: 00000000 |
| $02: 20000000 $12: 00000022 |
| |
| The ``PC`` value is an instruction address in the current firmware. |
| You would need to disassemble the firmware (/lib/firmware/qcom/aXXX_sqe.fw) via: |
| |
| .. code-block:: sh |
| |
| afuc-disasm -v a650_sqe.fw > a650_sqe.fw.disasm |
| |
| Now you should search for PC value in the disassembly, e.g.: |
| |
| .. code-block:: |
| |
| l018: 00d1: 08dd0001 add $addr, $06, 0x0001 |
| 00d2: 981ff806 mov $data, $data |
| 00d3: 8a080001 mov $08, 0x0001 << 16 |
| 00d4: 3108ffff or $08, $08, 0xffff |
| 00d5: 9be8f805 and $data, $data, $08 |
| 00d6: 9806e806 mov $addr, $06 |
| 00d7: 9803f806 mov $data, $03 <------------- HERE |
| 00d8: d8000000 waitin |
| 00d9: 981f0806 mov $01, $data |
| |
| |
| Command Stream Capture |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| During Mesa development, it's often useful to look at the command streams we |
| send to the kernel. We have an interface for the kernel to capture all |
| submitted command streams: |
| |
| .. code-block:: sh |
| |
| cat /sys/kernel/debug/dri/0/rd > cmdstream & |
| |
| By default, command stream capture does not capture texture/vertex/etc. data. |
| You can enable capturing all the BOs with: |
| |
| .. code-block:: sh |
| |
| echo Y > /sys/module/msm/parameters/rd_full |
| |
| Note that, since all command streams get captured, it is easy to run the system |
| out of memory doing this, so you probably don't want to enable it during play of |
| a heavyweight game. Instead, to capture a command stream within a game, you |
| probably want to cause a crash in the GPU during a frame of interest so that a |
| single GPU core dump is generated. Emitting ``0xdeadbeef`` in the CS should be |
| enough to cause a fault. |
| |
| ``fd_rd_output`` facilities provide support for generating the command stream |
| capture from inside Mesa. Different ``FD_RD_DUMP`` options are available: |
| |
| - ``enable`` simply enables dumping the command stream on each submit for a |
| given logical device. When a more advanced option is specified, ``enable`` is |
| implied as specified. |
| - ``combine`` will combine all dumps into a single file instead of writing the |
| dump for each submit into a standalone file. |
| - ``full`` will dump every buffer object, which is necessary for replays of |
| command streams (see below). |
| - ``trigger`` will establish a trigger file through which dumps can be better |
| controlled. Writing a positive integer value into the file will enable dumping |
| of that many subsequent submits. Writing -1 will enable dumping of submits |
| until disabled. Writing 0 (or any other value) will disable dumps. |
| |
| Output dump files and trigger file (when enabled) are hard-coded to be placed |
| under ``/tmp``, or ``/data/local/tmp`` under Android. `FD_RD_DUMP_TESTNAME` can |
| be used to specify a more descriptive prefix for the output or trigger files. |
| |
| Functionality is generic to any Freedreno-based backend, but is currently only |
| integrated in the MSM backend of Turnip. Using the existing ``TU_DEBUG=rd`` |
| option will translate to ``FD_RD_DUMP=enable``. |
| |
| Capturing Hang RD |
| +++++++++++++++++ |
| |
| Devcore file doesn't contain all submitted command streams, only the hanging one. |
| Additionally it is geared towards analyzing the GPU state at the moment of the crash. |
| |
| Alternatively, it's possible to obtain the whole submission with all command |
| streams via ``/sys/kernel/debug/dri/0/hangrd``: |
| |
| .. code-block:: sh |
| |
| sudo cat /sys/kernel/debug/dri/0/hangrd > logfile.rd // Do the cat _before_ the expected hang |
| |
| The format of hangrd is the same as in ordinary command stream capture. |
| ``rd_full`` also has the same effect on it. |
| |
| Replaying Command Stream |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| ``replay`` tool allows capturing and replaying ``rd`` to reproduce GPU faults. |
| Especially useful for transient GPU issues since it has much higher chances to |
| reproduce them. |
| |
| Dumping rendering results or even just memory is currently unsupported. |
| |
| - Replaying command streams requires kernel with ``MSM_INFO_SET_IOVA`` support. |
| - Requires ``rd`` capture to have full snapshots of the memory (``rd_full`` is enabled). |
| |
| Replaying is done via ``replay`` tool: |
| |
| .. code-block:: sh |
| |
| ./replay test_replay.rd |
| |
| More examples: |
| |
| .. code-block:: sh |
| |
| ./replay --first=start_submit_n --last=last_submit_n test_replay.rd |
| |
| .. code-block:: sh |
| |
| ./replay --override=0 test_replay.rd |
| |
| Editing Command Stream (a6xx+) |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| While replaying a fault is useful in itself, modifying the capture to |
| understand what causes the fault could be even more useful. |
| |
| ``rddecompiler`` decompiles a single cmdstream from ``rd`` into compilable C source. |
| Given the address space bounds the generated program creates a new ``rd`` which |
| could be used to override cmdstream with 'replay'. Generated ``rd`` is not replayable |
| on its own and depends on buffers provided by the source ``rd``. |
| |
| C source could be compiled by putting it into src/freedreno/decode/generate-rd.cc. |
| |
| The workflow would look like this: |
| |
| 1. Find the cmdstream № you want to edit; |
| 2. Decompile it: |
| |
| .. code-block:: sh |
| |
| ./rddecompiler -s %cmd_stream_n% example.rd > src/freedreno/decode/generate-rd.cc |
| |
| 3. Edit the command stream;; |
| 4. Compile and deploy freedreno tools; |
| 5. Plug the generator into cmdstream replay: |
| |
| .. code-block:: sh |
| |
| ./replay --override=%cmd_stream_№% |
| |
| 6. Repeat 3-5. |
| |
| GPU Hang Debugging |
| ^^^^^^^^^^^^^^^^^^ |
| |
| Not a guide for how to do it but mostly an enumeration of methods. |
| |
| Useful ``TU_DEBUG`` (for Turnip) options to narrow down the hang cause: |
| |
| ``sysmem``, ``gmem``, ``nobin``, ``forcebin``, ``noubwc``, ``nolrz``, ``flushall``, ``syncdraw``, ``rast_order`` |
| |
| Useful ``FD_MESA_DEBUG`` (for Freedreno) options: |
| |
| ``sysmem``, ``gmem``, ``nobin``, ``noubwc``, ``nolrz``, ``notile``, ``dclear``, ``ddraw``, ``flush``, ``inorder``, ``noblit`` |
| |
| Useful ``IR3_SHADER_DEBUG`` options: |
| |
| ``nouboopt``, ``spillall``, ``nopreamble``, ``nofp16`` |
| |
| Use Graphics Flight Recorder to narrow down the place which hangs, |
| use our own breadcrumbs implementation in case of unrecoverable hangs. |
| |
| In case of faults use RenderDoc to find the problematic command. If it's |
| a draw call, edit shader in RenderDoc to find whether it culprit is a shader. |
| If yes, bisect it. |
| |
| If editing the shader messes the assembly too much and the issue becomes unreproducible |
| try editing the assembly itself via ``IR3_SHADER_OVERRIDE_PATH``. |
| |
| If fault or hang is transient try capturing an ``rd`` and replay it. If issue |
| is reproduced - bisect the GPU packets until the culprit is found. |
| |
| Do the above if culprit is not a shader. |
| |
| The hang recovery mechanism in Kernel is not perfect, in case of unrecoverable |
| hangs check whether the kernel is up to date and look for unmerged patches |
| which could improve the recovery. |
| |
| GPU Breadcrumbs |
| +++++++++++++++ |
| |
| Breadcrumbs described below are available only in Turnip. |
| |
| Freedreno has simpler breadcrumbs, in debug build writes breadcrumbs |
| into ``CP_SCRATCH_REG[6]`` and per-tile breadcrumbs into ``CP_SCRATCH_REG[7]``, |
| in this way they are available in the devcoredump. TODO: generalize Tunip's |
| breadcrumbs implementation. |
| |
| This is a simple implementations of breadcrumbs tracking of GPU progress |
| intended to be a last resort when debugging unrecoverable hangs. |
| For best results use Vulkan traces to have a predictable place of hang. |
| |
| For ordinary hangs as a more user-friendly solution use GFR |
| "Graphics Flight Recorder". |
| |
| Or breadcrumbs implementation aims to handle cases where nothing can be done |
| after the hang. In-driver breadcrumbs also allow more precise tracking since |
| we could target a single GPU packet. |
| |
| While breadcrumbs support gmem, try to reproduce the hang in a sysmem mode |
| because it would require much less breadcrumb writes and syncs. |
| |
| Breadcrumbs settings: |
| |
| .. code-block:: sh |
| |
| TU_BREADCRUMBS=%IP%:%PORT%,break=%BREAKPOINT%:%BREAKPOINT_HITS% |
| |
| ``BREAKPOINT`` |
| The breadcrumb starting from which we require explicit ack. |
| ``BREAKPOINT_HITS`` |
| How many times breakpoint should be reached for break to occur. |
| Necessary for a gmem mode and re-usable cmdbuffers in both of which |
| the same cmdstream could be executed several times. |
| |
| A typical work flow would be: |
| |
| - Start listening for breadcrumbs on a remote host: |
| |
| .. code-block:: sh |
| |
| nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}' |
| |
| - Start capturing command stream; |
| - Replay the hanging trace with: |
| |
| .. code-block:: sh |
| |
| TU_BREADCRUMBS=$IP:$PORT,break=-1:0 |
| |
| - Increase hangcheck period: |
| |
| .. code-block:: sh |
| |
| echo -n 60000 > /sys/kernel/debug/dri/0/hangcheck_period_ms |
| |
| - After GPU hang note the last breadcrumb and relaunch trace with: |
| |
| .. code-block:: sh |
| |
| TU_BREADCRUMBS=%IP%:%PORT%,break=%LAST_BREADCRUMB%:%HITS% |
| |
| - After the breakpoint is reached each breadcrumb would require |
| explicit ack from the user. This way it's possible to find |
| the last packet which didn't hang. |
| |
| - Find the packet in the decoded cmdstream. |
| |
| Debugging random failures |
| ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| In most cases random GPU faults and rendering artifacts are caused by some kind |
| of undefined behavior that falls under the following categories: |
| |
| - Usage of a stale reg value; |
| - Usage of stale memory (e.g. expecting it to be zeroed when it is not); |
| - Lack of the proper synchronization. |
| |
| Finding instances of stale reg reads |
| ++++++++++++++++++++++++++++++++++++ |
| |
| Turnip has a debug option to stomp the registers with invalid values to catch |
| the cases where stale data is read. |
| |
| .. code-block:: sh |
| |
| MESA_VK_ABORT_ON_DEVICE_LOSS=1 \ |
| TU_DEBUG_STALE_REGS_RANGE=0x00000c00,0x0000be01 \ |
| TU_DEBUG_STALE_REGS_FLAGS=cmdbuf,renderpass \ |
| ./app |
| |
| .. envvar:: TU_DEBUG_STALE_REGS_RANGE |
| |
| the reg range in which registers would be stomped. Add ``inverse`` to the |
| flags in order for this range to specify which registers NOT to stomp. |
| |
| .. envvar:: TU_DEBUG_STALE_REGS_FLAGS |
| |
| ``cmdbuf`` |
| stomp registers at the start of each command buffer. |
| ``renderpass`` |
| stomp registers before each render pass. |
| ``inverse`` |
| changes ``TU_DEBUG_STALE_REGS_RANGE`` meaning to |
| "regs that should NOT be stomped". |
| |
| The best way to pinpoint the reg which causes a failure is to bisect the regs |
| range. In case when a fail is caused by combination of several registers |
| the ``inverse`` flag may be set to find the reg which prevents the failure. |