| Asahi |
| ===== |
| |
| The Asahi driver aims to provide an OpenGL implementation for the Apple M1. |
| |
| Wrap (macOS only) |
| ----------------- |
| |
| Mesa includes a library that wraps the key IOKit entrypoints used in the macOS |
| UABI for AGX. The wrapped routines print information about the kernel calls made |
| and dump work submitted to the GPU using agxdecode. This facilitates |
| reverse-engineering the hardware, as glue to get at the "interesting" GPU |
| memory. |
| |
| The library is only built if ``-Dtools=asahi`` is passed. It builds a single |
| ``wrap.dylib`` file, which should be inserted into a process with the |
| ``DYLD_INSERT_LIBRARIES`` environment variable. |
| |
| For example, to trace an app ``./app``, run: |
| |
| DYLD_INSERT_LIBRARIES=~/mesa/build/src/asahi/lib/libwrap.dylib ./app |
| |
| Hardware varyings |
| ----------------- |
| |
| At an API level, vertex shader outputs need to be interpolated to become |
| fragment shader inputs. This process is logically pipelined in AGX, with a value |
| traveling from a vertex shader to remapping hardware to coefficient register |
| setup to the fragment shader to the iterator hardware. Each stage is described |
| below. |
| |
| Vertex shader |
| ````````````` |
| |
| A vertex shader (running on the :term:`Unified Shader Cores`) outputs varyings with the |
| ``st_var`` instruction. ``st_var`` takes a *vertex output index* and a 32-bit |
| value. The maximum number of *vertex outputs* is specified as the "output count" |
| of the shader in the "Bind Vertex Pipeline" packet. The value may be interpreted |
| consist of a single 32-bit value or an aligned 16-bit register pair, depending |
| on whether interpolation should happen at 32-bit or 16-bit. Vertex outputs are |
| indexed starting from 0, with the *vertex position* always coming first, the |
| 32-bit user varyings coming next with perspective, flat, and linear interpolated |
| varyings grouped in that order, then 16-bit user varyings with the same groupings, |
| and finally *point size* and *clip distances* at the end if present. Note that |
| *clip distances* are not accessible from the fragment shader; if the fragment |
| shader needs to read the interpolated clip distance, the vertex shader must |
| *also* write the clip distance values to a user varying for the fragment shader |
| to interpolate. Also note there is no clip plane enable mask anywhere; that must |
| lowered for APIs that require this (OpenGL but not Vulkan). |
| |
| .. list-table:: Ordering of vertex outputs with all outputs used |
| :widths: 25 75 |
| :header-rows: 1 |
| |
| * - Size (words) |
| - Value |
| * - 4 |
| - Vertex position |
| * - 1 |
| - 32-bit smooth varying 0 |
| * - |
| - ... |
| * - 1 |
| - 32-bit smooth varying m |
| * - 1 |
| - 32-bit flat varying 0 |
| * - |
| - ... |
| * - 1 |
| - 32-bit flat varying n |
| * - 1 |
| - 32-bit linear varying 0 |
| * - |
| - ... |
| * - 1 |
| - 32-bit linear varying o |
| * - 1 |
| - Packed pair of 16-bit smooth varyings 0 |
| * - |
| - ... |
| * - 1 |
| - Packed pair of 16-bit smooth varyings p |
| * - 1 |
| - Packed pair of 16-bit flat varyings 0 |
| * - |
| - ... |
| * - 1 |
| - Packed pair of 16-bit flat varyings q |
| * - 1 |
| - Packed pair of 16-bit linear varyings 0 |
| * - |
| - ... |
| * - 1 |
| - Packed pair of 16-bit linear varyings r |
| * - 1 |
| - Point size |
| * - 1 |
| - Clip distance for plane 0 |
| * - |
| - ... |
| * - 1 |
| - Clip distance for plane 15 |
| |
| Remapping |
| ````````` |
| |
| Vertex outputs are remapped to varying slots to be interpolated. |
| The output of remapping consists of the following items: the *W* fragment |
| coordinate, the *Z* fragment coordinate, user varyings in the vertex |
| output order. *Z* may be omitted, but *W* may not be. This remapping is |
| configured by the "Output select" word. |
| |
| .. list-table:: Ordering of remapped slots |
| :widths: 25 75 |
| :header-rows: 1 |
| |
| * - Index |
| - Value |
| * - 0 |
| - Fragment coord W |
| * - 1 |
| - Fragment coord Z |
| * - 2 |
| - 32-bit varying 0 |
| * - |
| - ... |
| * - 2 + m |
| - 32-bit varying m |
| * - 2 + m + 1 |
| - Packed pair of 16-bit varyings 0 |
| * - |
| - ... |
| * - 2 + m + n + 1 |
| - Packed pair of 16-bit varyings n |
| |
| Coefficient registers |
| ````````````````````` |
| |
| The fragment shader does not see the physical slots. |
| Instead, it references varyings through *coefficient registers*. A coefficient |
| register is a register allocated constant for all fragment shader invocations in |
| a given polygon. Physically, it contains the values output by the vertex shader |
| for each vertex of the polygon. Coefficient registers are preloaded with values |
| from varying slots. This preloading appears to occur in fixed function hardware, |
| a simplification from PowerVR which requires a specialized program for the |
| programmable data sequencer to do the preload. |
| |
| The "Bind fragment pipeline" packet points to coefficient register bindings, |
| preceded by a header. The header contains the number of 32-bit varying slots. As |
| the *W* slot is always present, this field is always nonzero. Slots whose index |
| is below this count are treated as 32-bit. The remaining slots are treated as |
| 16-bits. |
| |
| The header also contains the total number of coefficient registers bound. |
| |
| Each binding that follows maps a (vector of) varying slots to a (consecutive) |
| coefficient registers. Some details about the varying (perspective |
| interpolation, flat shading, point sprites) are configured here. |
| |
| Coefficient registers may be ordered the same as the internal varying slots. |
| However, this may be inconvenient for some APIs that require a separable shader |
| model. For these APIs, the flexibility to mix-and-match slots and coefficient |
| registers allows mixing shaders without shader variants. In that case, the |
| bindings should be generated outside of the compiler. For simple APIs where the |
| bindings are fixed and known at compile-time, the bindings could be generated |
| within the compiler. |
| |
| Fragment shader |
| ``````````````` |
| |
| In the fragment shader, coefficient registers, identified by the prefix ``cf`` |
| followed by a decimal index, act as opaque handles to varyings. For flat |
| shading, coefficient registers may be loaded into general registers with the |
| ``ldcf`` instruction. For smooth shading, the coefficient register corresponding |
| to the desired varying is passed as an argument to the "iterate" instruction |
| ``iter`` in order to "iterate" (interpolate) a varying. As perspective correct |
| interpolation also requires the W component of the fragment coordinate, the |
| coefficient register for W is passed as a second argument. As an example, if |
| there's a single varying to interpolate, an instruction like ``iter r0, cf1, cf0`` |
| is used. |
| |
| Iterator |
| ```````` |
| |
| To actually interpolate varyings, AGX provides fixed-function iteration hardware |
| to multiply the specified coefficient registers with the required barycentrics, |
| producing an interpolated value, hence the name "coefficient register". This |
| operation is purely mathematical and does not require any memory access, as |
| the required coefficients are preloaded before the shader begins execution. |
| That means the iterate instruction executes in constant time, does not signal |
| a data fence, and does not require the shader to wait on a data fence before |
| using the value. |
| |
| Image layouts |
| ------------- |
| |
| AGX supports several image layouts, described here. To work with image layouts |
| in the drivers, use the ail library, located in ``src/asahi/layout``. |
| |
| The simplest layout is **strided linear**. Pixels are stored in raster-order in |
| memory with a software-controlled stride. Strided linear images are useful for |
| working with modifier-unaware window systems, however performance will suffer. |
| Strided linear images have numerous limitations: |
| |
| - Strides must be a multiple of 16 bytes. |
| - Strides must be nonzero. For 1D images where the stride is logically |
| irrelevant, ail will internally select the minimal stride. |
| - Only 1D, 2D, and 2D Array images may be linear. In particular, no 3D or cubemaps. |
| - 2D images must not be mipmapped. |
| - Block-compressed formats and multisampled images are unsupported. Elements of |
| a strided linear image are simply pixels. |
| |
| With these limitations, addressing into a strided linear image is as simple as |
| |
| .. math:: |
| |
| \text{address} = (y \cdot \text{stride}) + (x \cdot \text{bytes per pixel}) |
| |
| In practice, this suffices for window system integration and little else. |
| |
| The most common uncompressed layout is **twiddled**. The image is divided into |
| power-of-two sized tiles. The tiles themselves are stored in raster-order. |
| Within each tile, elements (pixels/blocks) are stored in Morton (Z) order. |
| |
| The tile size used depends on both the image size and the block size of the |
| image format. For large images, :math:`n \times n` or :math:`2n \times n` tiles |
| are used (:math:`n` power-of-two). :math:`n` is such that each page contains |
| exactly one tile. Only power-of-two block sizes are supported in hardware, |
| ensuring such a tile size always exists. The hardware uses 16 KiB pages, so tile |
| sizes are as follows: |
| |
| .. list-table:: Tile sizes for large images |
| :widths: 50 50 |
| :header-rows: 1 |
| |
| * - Bytes per block |
| - Tile size |
| * - 1 |
| - 128 x 128 |
| * - 2 |
| - 128 x 64 |
| * - 4 |
| - 64 x 64 |
| * - 8 |
| - 64 x 32 |
| * - 16 |
| - 32 x 32 |
| |
| The dimensions of large images are rounded up to be multiples of the tile size. |
| In addition, non-power-of-two large images have extra padding tiles when |
| mipmapping is used, see below. |
| |
| That rounding would waste a great deal of memory for small images. If |
| an image is smaller than this tile size, a smaller tile size is used to reduce |
| the memory footprint. For small images, the tile size is :math:`m \times m` |
| where |
| |
| .. math:: |
| |
| m = 2^{\lceil \log_2( \min \{ \text{width}, \text{ height} \}) \rceil} |
| |
| In other words, small images use the smallest square power-of-two tile such that |
| the image's minor axis fits in one tile. |
| |
| For mipmapped images, tile sizes are determined independently for each level. |
| Typically, the first levels of an image are "large" and the remaining levels are |
| "small". This scheme reduces the memory footprint of mipmapping, compared to a |
| fixed tile size for the whole image. Each mip level are padded to fill at least |
| one cache line (128 bytes), ensure no cache line contains multiple mip levels. |
| |
| There is a wrinkle: the dimensions of large mip levels in tiles are determined |
| by the dimensions of level 0. For power-of-two images, the two calculations are |
| equivalent. However, they differ subtly for non-power-of-two images. To |
| determine the number of tiles to allocate for level :math:`l`, the number of |
| tiles for level 0 should be right-shifted by :math:`2l`. That appears to divide |
| by :math:`2^l` in both width and height, matching the definition of mipmapping, |
| however it rounds down incorrectly. To compensate, the level contains one extra |
| row, column, or both (with the corner) as required if any of the first :math:`l` |
| levels were rounded down. This hurt the memory footprint. However, it means |
| non-power-of-two integer multiplication is only required for level 0. |
| Calculating the sizes for subsequent levels requires only addition and bitwise |
| math. That simplifies the hardware (but complicates software). |
| |
| A 2D image consists of a full miptree (constructed as above) rounded up to the |
| page size (16 KiB). |
| |
| 3D images consist simply of an array of 2D layers (constructed as above). That |
| means cube maps, 2D arrays, cube map arrays, and 3D images all use the same |
| layout. The only difference is the number of layers. Notably, 3D images (like |
| ``GL_TEXTURE_3D``) reserve space even for mip levels that do not exist |
| logically. These extra levels pad out layers of 3D images to the size of the |
| first layer, simplifying layout calculations for both software and hardware. |
| Although the padding is logically unnecessary, it wastes little space compared |
| to the sizes of large mipmapped 3D textures. |
| |
| drm-shim (Linux only) |
| --------------------- |
| |
| Mesa includes a library that mocks out the DRM UABI used by the Asahi driver |
| stack, allowing the Mesa driver to run on non-M1 Linux hardware. This can be |
| useful for exercising the compiler. To build, use options: |
| |
| :: |
| |
| -Dgallium-drivers=asahi -Dtools=drm-shim |
| |
| Then run an OpenGL workload with environment variable: |
| |
| .. code-block:: sh |
| |
| LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so |
| |
| For example to compile a shader with shaderdb and print some statistics along |
| with the IR: |
| |
| .. code-block:: sh |
| |
| ~/shader-db$ AGX_MESA_DEBUG=shaders,shaderdb ASAHI_MESA_DEBUG=precompile LD_PRELOAD=~/mesa/build/src/asahi/drm-shim/libasahi_noop_drm_shim.so ./run shaders/glmark/1-12.shader_test |
| |
| The drm-shim implementation for Asahi is located in ``src/asahi/drm-shim``. The |
| drm-shim implementation there should be updated as new UABI is added. |
| |
| Hardware glossary |
| ----------------- |
| |
| AGX is a tiled renderer descended from the PowerVR architecture. Some hardware |
| concepts used in PowerVR GPUs appear in AGX. |
| |
| .. glossary:: :sorted: |
| |
| VDM |
| Vertex Data Master |
| Dispatches vertex shaders. |
| |
| PDM |
| Pixel Data Master |
| Dispatches pixel shaders. |
| |
| CDM |
| Compute Data Master |
| Dispatches compute kernels. |
| |
| USC |
| Unified Shader Cores |
| A unified shader core is a small CPU that runs shader code. The core is |
| unified because a single ISA is used for vertex, pixel and compute |
| shaders. This differs from older GPUs where the vertex, fragment and |
| compute have separate ISAs for shader stages. |
| |
| PPP |
| Primitive Processing Pipeline |
| The Primitive Processing Pipeline is a hardware unit that does primitive |
| assembly. The PPP is between the :term:`VDM` and :term:`ISP`. |
| |
| ISP |
| Image Synthesis Processor |
| The Image Synthesis Processor is responsible for the rasterization stage |
| of the rendering pipeline. |
| |
| PBE |
| Pixel BackEnd |
| Hardware unit which writes to color attachments and images. Also the |
| name for a descriptor passed to :term:`PBE` instructions. |
| |
| UVS |
| Unified Vertex Store |
| Hardware unit which buffers the outputs of the vertex shader (varyings). |