| ANV |
| === |
| |
| Experimental features |
| --------------------- |
| |
| .. _`Bindless model`: |
| |
| Binding Model |
| ------------- |
| |
| Here is the ANV bindless binding model that was implemented for the |
| descriptor indexing feature of Vulkan 1.2 : |
| |
| .. graphviz:: |
| |
| digraph G { |
| fontcolor="black"; |
| compound=true; |
| |
| subgraph cluster_1 { |
| label = "Binding Table (HW)"; |
| |
| bgcolor="cornflowerblue"; |
| |
| node [ style=filled,shape="record",fillcolor="white", |
| label="RT0" ] n0; |
| node [ label="RT1" ] n1; |
| node [ label="dynbuf0"] n2; |
| node [ label="set0" ] n3; |
| node [ label="set1" ] n4; |
| node [ label="set2" ] n5; |
| |
| n0 -> n1 -> n2 -> n3 -> n4 -> n5 [style=invis]; |
| } |
| subgraph cluster_2 { |
| label = "Descriptor Set 0"; |
| |
| bgcolor="burlywood3"; |
| fixedsize = true; |
| |
| node [ style=filled,shape="record",fillcolor="white", fixedsize = true, width=4, |
| label="binding 0 - STORAGE_IMAGE\n anv_storage_image_descriptor" ] n8; |
| node [ label="binding 1 - COMBINED_IMAGE_SAMPLER\n anv_sampled_image_descriptor" ] n9; |
| node [ label="binding 2 - UNIFORM_BUFFER\n anv_address_range_descriptor" ] n10; |
| node [ label="binding 3 - UNIFORM_TEXEL_BUFFER\n anv_storage_image_descriptor" ] n11; |
| |
| n8 -> n9 -> n10 -> n11 [style=invis]; |
| } |
| subgraph cluster_5 { |
| label = "Vulkan Objects" |
| |
| fontcolor="black"; |
| bgcolor="darkolivegreen4"; |
| |
| subgraph cluster_6 { |
| label = "VkImageView"; |
| |
| bgcolor=darkolivegreen3; |
| node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2, |
| label="surface_state" ] n12; |
| } |
| subgraph cluster_7 { |
| label = "VkSampler"; |
| |
| bgcolor=darkolivegreen3; |
| node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2, |
| label="sample_state" ] n13; |
| } |
| subgraph cluster_8 { |
| label = "VkImageView"; |
| bgcolor="darkolivegreen3"; |
| |
| node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2, |
| label="surface_state" ] n14; |
| } |
| subgraph cluster_9 { |
| label = "VkBuffer"; |
| bgcolor=darkolivegreen3; |
| |
| node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2, |
| label="address" ] n15; |
| } |
| subgraph cluster_10 { |
| label = "VkBufferView"; |
| |
| bgcolor=darkolivegreen3; |
| node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2, |
| label="surface_state" ] n16; |
| } |
| |
| n12 -> n13 -> n14 -> n15 -> n16 [style=invis]; |
| } |
| |
| subgraph cluster_11 { |
| subgraph cluster_12 { |
| label = "CommandBuffer state stream"; |
| |
| bgcolor="gold3"; |
| node [ style=filled,shape="box",fillcolor="white", fixedsize = true, width=2, |
| label="surface_state" ] n17; |
| node [ label="surface_state" ] n18; |
| node [ label="surface_state" ] n19; |
| |
| n17 -> n18 -> n19 [style=invis]; |
| } |
| } |
| |
| n3 -> n8 [lhead=cluster_2]; |
| |
| n8 -> n12; |
| n9 -> n13; |
| n9 -> n14; |
| n10 -> n15; |
| n11 -> n16; |
| |
| n0 -> n17; |
| n1 -> n18; |
| n2 -> n19; |
| } |
| |
| |
| |
| The HW binding table is generated when the draw or dispatch commands |
| are emitted. Here are the types of entries one can find in the binding |
| table : |
| |
| - The currently bound descriptor sets, one entry per descriptor set |
| (our limit is 8). |
| |
| - For dynamic buffers, one entry per dynamic buffer. |
| |
| - For draw commands, render target entries if needed. |
| |
| The entries of the HW binding table for descriptor sets are |
| RENDER_SURFACE_STATE similar to what you would have for a normal |
| uniform buffer. The shader will emit reads this buffer first to get |
| the information it needs to access a surface/sampler/etc... and then |
| emits the appropriate message using the information gathered from the |
| descriptor set buffer. |
| |
| Each binding type entry gets an associated structure in memory |
| (``anv_storage_image_descriptor``, ``anv_sampled_image_descriptor``, |
| ``anv_address_range_descriptor``, ``anv_storage_image_descriptor``). |
| This is the information read by the shader. |
| |
| |
| .. _`Binding tables`: |
| |
| Binding Tables |
| -------------- |
| |
| Binding tables are arrays of 32bit offset entries referencing surface |
| states. This is how shaders can refer to binding table entry to read |
| or write a surface. For example fragment shaders will often refer to |
| entry 0 as the first render target. |
| |
| The way binding tables are managed is fairly awkward. |
| |
| Each shader stage must have its binding table programmed through |
| a corresponding instruction |
| ``3DSTATE_BINDING_TABLE_POINTERS_*`` (each stage has its own). |
| |
| .. graphviz:: |
| |
| digraph structs { |
| node [shape=record]; |
| struct3 [label="{ binding tables\n area | { <bt4> BT4 | <bt3> BT3 | ... | <bt0> BT0 } }|{ surface state\n area |{<ss0> ss0|<ss1> ss1|<ss2> ss2|...}}"]; |
| struct3:bt0 -> struct3:ss0; |
| struct3:bt0 -> struct3:ss1; |
| } |
| |
| |
| The value programmed in the ``3DSTATE_BINDING_TABLE_POINTERS_*`` |
| instructions is not a 64bit pointer but an offset from the address |
| programmed in ``STATE_BASE_ADDRESS::Surface State Base Address`` or |
| ``3DSTATE_BINDING_TABLE_POOL_ALLOC::Binding Table Pool Base Address`` |
| (available on Gfx11+). The offset value in |
| ``3DSTATE_BINDING_TABLE_POINTERS_*`` is also limited to a few bits |
| (not a full 32bit value), meaning that as we use more and more binding |
| tables we need to reposition ``STATE_BASE_ADDRESS::Surface State Base |
| Address`` to make space for new binding table arrays. |
| |
| To make things even more awkward, the binding table entries are also |
| relative to ``STATE_BASE_ADDRESS::Surface State Base Address`` so as |
| we change ``STATE_BASE_ADDRESS::Surface State Base Address`` we need |
| add that offsets to the binding table entries. |
| |
| The way with deal with this is that we allocate 4Gb of address space |
| (since the binding table entries can address 4Gb of surface state |
| elements). We reserve the first gigabyte exclusively to binding |
| tables, so that anywhere we position our binding table in that first |
| gigabyte, it can always refer to the surface states in the next 3Gb. |
| |
| |
| .. _`Descriptor Set Memory Layout`: |
| |
| Descriptor Set Memory Layout |
| ---------------------------- |
| |
| Here is a representation of how the descriptor set bindings, with each |
| elements in each binding is mapped to a the descriptor set memory : |
| |
| .. graphviz:: |
| |
| digraph structs { |
| node [shape=record]; |
| rankdir=LR; |
| |
| struct1 [label="Descriptor Set | \ |
| <b0> binding 0\n STORAGE_IMAGE \n (array_length=3) | \ |
| <b1> binding 1\n COMBINED_IMAGE_SAMPLER \n (array_length=2) | \ |
| <b2> binding 2\n UNIFORM_BUFFER \n (array_length=1) | \ |
| <b3> binding 3\n UNIFORM_TEXEL_BUFFER \n (array_length=1)"]; |
| struct2 [label="Descriptor Set Memory | \ |
| <b0e0> anv_storage_image_descriptor|\ |
| <b0e1> anv_storage_image_descriptor|\ |
| <b0e2> anv_storage_image_descriptor|\ |
| <b1e0> anv_sampled_image_descriptor|\ |
| <b1e1> anv_sampled_image_descriptor|\ |
| <b2e0> anv_address_range_descriptor|\ |
| <b3e0> anv_storage_image_descriptor"]; |
| |
| struct1:b0 -> struct2:b0e0; |
| struct1:b0 -> struct2:b0e1; |
| struct1:b0 -> struct2:b0e2; |
| struct1:b1 -> struct2:b1e0; |
| struct1:b1 -> struct2:b1e1; |
| struct1:b2 -> struct2:b2e0; |
| struct1:b3 -> struct2:b3e0; |
| } |
| |
| Each Binding in the descriptor set is allocated an array of |
| ``anv_*_descriptor`` data structure. The type of ``anv_*_descriptor`` |
| used for a binding is selected based on the ``VkDescriptorType`` of |
| the bindings. |
| |
| The value of ``anv_descriptor_set_binding_layout::descriptor_offset`` |
| is a byte offset from the descriptor set memory to the associated |
| binding. ``anv_descriptor_set_binding_layout::array_size`` is the |
| number of ``anv_*_descriptor`` elements in the descriptor set memory |
| from that offset for the binding. |
| |
| |
| Pipeline state emission |
| ----------------------- |
| |
| Vulkan initially started by baking as much state as possible in |
| pipelines. But extension after extension, more and more state has |
| become potentially dynamic. |
| |
| ANV tries to limit the amount of time an instruction has to be packed |
| to reprogram part of the 3D pipeline state. The packing is happening |
| in 2 places : |
| |
| - ``genX_pipeline.c`` where the non dynamic state is emitted in the |
| pipeline batch. Chunks of the batches are copied into the command |
| buffer as a result of calling ``vkCmdBindPipeline()``, depending on |
| what changes from the previously bound graphics pipeline |
| |
| - ``genX_gfx_state.c`` where the dynamic state is added to already |
| packed instructions from ``genX_pipeline.c`` |
| |
| The rule to know where to emit an instruction programming the 3D |
| pipeline is as follow : |
| |
| - If any field of the instruction can be made dynamic, it should be |
| emitted in ``genX_gfx_state.c`` |
| |
| - Otherwise, the instruction can be emitted in ``genX_pipeline.c`` |
| |
| When a piece of state programming is dynamic, it should have a |
| corresponding field in ``anv_gfx_dynamic_state`` and the |
| ``genX(cmd_buffer_flush_gfx_runtime_state)`` function should be |
| updated to ensure we minimize the amount of time an instruction should |
| be emitted. Each instruction should have a associated |
| ``ANV_GFX_STATE_*`` mask so that the dynamic emission code can tell |
| when to re-emit an instruction. |
| |
| |
| Generated indirect draws optimization |
| ------------------------------------- |
| |
| Indirect draws have traditionally been implemented on Intel HW by |
| loading the indirect parameters from memory into HW registers using |
| the command streamer's ``MI_LOAD_REGISTER_MEM`` instruction before |
| dispatching a draw call to the 3D pipeline. |
| |
| On recent products, it was found that the command streamer is showing |
| as performance bottleneck, because it cannot dispatch draw calls fast |
| enough to keep the 3D pipeline busy. |
| |
| The solution to this problem is to change the way we deal with |
| indirect draws. Instead of loading HW registers with values using the |
| command streamer, we generate entire set of ``3DPRIMITIVE`` |
| instructions using a shader. The generated instructions contain the |
| entire draw call parameters. This way the command streamer executes |
| only ``3DPRIMITIVE`` instructions and doesn't do any data loading from |
| memory or touch HW registers, feeding the 3D pipeline as fast as it |
| can. |
| |
| In ANV this implemented in 2 different ways : |
| |
| By generating instructions directly into the command stream using a |
| side batch buffer. When ANV encounters the first indirect draws, it |
| generates a jump into the side batch, the side batch contains a draw |
| call using a generation shader for each indirect draw. We keep adding |
| on more generation draws into the batch until we have to stop due to |
| command buffer end, secondary command buffer calls or a barrier |
| containing the access flag ``VK_ACCESS_INDIRECT_COMMAND_READ_BIT``. |
| The side batch buffer jump back right after the instruction where it |
| was called. Here is a high level diagram showing how the generation |
| batch buffer writes in the main command buffer : |
| |
| .. graphviz:: |
| |
| digraph commands_mode { |
| rankdir = "LR" |
| "main-command-buffer" [ |
| label = "main command buffer|...|draw indirect0 start|<f0>jump to\ngeneration batch|<f1>|<f2>empty instruction0|<f3>empty instruction1|...|draw indirect0 end|...|draw indirect1 start|<f4>empty instruction0|<f5>empty instruction1|...|<f6>draw indirect1 end|..." |
| shape = "record" |
| ]; |
| "generation-command-buffer" [ |
| label = "generation command buffer|<f0>|<f1>write draw indirect0|<f2>write draw indirect1|...|<f3>exit jump" |
| shape = "record" |
| ]; |
| "main-command-buffer":f0 -> "generation-command-buffer":f0; |
| "generation-command-buffer":f1 -> "main-command-buffer":f2 [color="#0000ff"]; |
| "generation-command-buffer":f1 -> "main-command-buffer":f3 [color="#0000ff"]; |
| "generation-command-buffer":f2 -> "main-command-buffer":f4 [color="#0000ff"]; |
| "generation-command-buffer":f2 -> "main-command-buffer":f5 [color="#0000ff"]; |
| "generation-command-buffer":f3 -> "main-command-buffer":f1; |
| } |
| |
| By generating instructions into a ring buffer of commands, when the |
| draw count number is high. This solution allows smaller batches to be |
| emitted. Here is a high level diagram showing how things are |
| executed : |
| |
| .. graphviz:: |
| |
| digraph ring_mode { |
| rankdir=LR; |
| "main-command-buffer" [ |
| label = "main command buffer|...| draw indirect |<f1>generation shader|<f2> jump to ring|<f3> increment\ndraw_base|<f4>..." |
| shape = "record" |
| ]; |
| "ring-buffer" [ |
| label = "ring buffer|<f0>generated draw0|<f1>generated draw1|<f2>generated draw2|...|<f3>exit jump" |
| shape = "record" |
| ]; |
| "main-command-buffer":f2 -> "ring-buffer":f0; |
| "ring-buffer":f3 -> "main-command-buffer":f3; |
| "ring-buffer":f3 -> "main-command-buffer":f4; |
| "main-command-buffer":f3 -> "main-command-buffer":f1; |
| "main-command-buffer":f1 -> "ring-buffer":f1 [color="#0000ff"]; |
| "main-command-buffer":f1 -> "ring-buffer":f2 [color="#0000ff"]; |
| } |
| |
| Runtime dependencies |
| -------------------- |
| |
| Starting with Intel 12th generation/Alder Lake-P and Intel Arc Alchemist, the Intel 3D driver stack requires GuC firmware for proper operation. You have two options to install the firmware: |
| |
| - Distro package: Install the pre-packaged firmware included in your Linux distribution's repositories. |
| - Manual download: You can download the firmware from the official repository: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915. Place the downloaded files in the /lib/firmware/i915 directory. |
| |
| Important: For optimal performance, we recommend updating the GuC firmware to version 70.6.3 or later. |