Documentation/accel/amdxdna/amdnpu.rst - kernel/common - Git at Google

 .. SPDX-License-Identifier: GPL-2.0-only

 .. include:: <isonum.txt>

 =========
  AMD NPU
 =========

 :Copyright: |copy| 2024 Advanced Micro Devices, Inc.
 :Author: Sonal Santan <sonal.santan@amd.com>

 Overview
 ========

 AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator
 integrated into AMD client APU. NPU enables efficient execution of Machine
 Learning applications like CNN, LLM, etc. NPU is based on
 `AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver.


 Hardware Description
 ====================

 AMD NPU consists of the following hardware components:

 AMD XDNA Array
 --------------

 AMD XDNA Array comprises of 2D array of compute and memory tiles built with
 `AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1
 row of memory tile. Each compute tile contains a VLIW processor with its own
 dedicated program and data memory. The memory tile acts as L2 memory. The 2D
 array can be partitioned at a column boundary creating a spatially isolated
 partition which can be bound to a workload context.

 Each column also has dedicated DMA engines to move data between host DDR and
 memory tile.

 AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of
 compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8
 topology, i.e., 4 rows of compute tiles arranged into 8 columns.

 Shared L2 Memory
 ----------------

 The single row of memory tiles create a pool of software managed on chip L2
 memory. DMA engines are used to move data between host DDR and memory tiles.
 AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory.
 AMD Strix Point NPU has a total of 4096 KB of L2 memory.

 Microcontroller
 ---------------

 A microcontroller runs NPU Firmware which is responsible for command processing,
 XDNA Array partition setup, XDNA Array configuration, workload context
 management and workload orchestration.

 NPU Firmware uses a dedicated instance of an isolated non-privileged context
 called ERT to service each workload context. ERT is also used to execute user
 provided ``ctrlcode`` associated with the workload context.

 NPU Firmware uses a single isolated privileged context called MERT to service
 management commands from the amdxdna driver.

 Mailboxes
 ---------

 The microcontroller and amdxdna driver use a privileged channel for management
 tasks like setting up of contexts, telemetry, query, error handling, setting up
 user channel, etc. As mentioned before, privileged channel requests are
 serviced by MERT. The privileged channel is bound to a single mailbox.

 The microcontroller and amdxdna driver use a dedicated user channel per
 workload context. The user channel is primarily used for submitting work to
 the NPU. As mentioned before, a user channel requests are serviced by an
 instance of ERT. Each user channel is bound to its own dedicated mailbox.

 PCIe EP
 -------

 NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some
 MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric
 for reading or writing into host memory. Each instance of ERT gets its own
 dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt.

 The number of PCIe BARs varies depending on the specific device. Based on their
 functions, PCIe BARs can generally be categorized into the following types.

 * PSP BAR: Expose the AMD PSP (Platform Security Processor) function
 * SMU BAR: Expose the AMD SMU (System Management Unit) function
 * SRAM BAR: Expose ring buffers for the mailbox
 * Mailbox BAR: Expose the mailbox control registers (head, tail and ISR
   registers etc.)
 * Public Register BAR: Expose public registers

 On specific devices, the above-mentioned BAR type might be combined into a
 single physical PCIe BAR. Or a module might require two physical PCIe BARs to
 be fully functional. For example,

 * On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0.
 * On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR
   index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR)
   and PCIe BAR index 4 (PSP BAR).

 Process Isolation Hardware
 --------------------------

 As explained before, XDNA Array can be dynamically divided into isolated
 spatial partitions, each of which may have one or more columns. The spatial
 partition is setup by programming the column isolation registers by the
 microcontroller. Each spatial partition is associated with a PASID which is
 also programmed by the microcontroller. Hence multiple spatial partitions in
 the NPU can make concurrent host access protected by PASID.

 The NPU FW itself uses microcontroller MMU enforced isolated contexts for
 servicing user and privileged channel requests.


 Mixed Spatial and Temporal Scheduling
 =====================================

 AMD XDNA architecture supports mixed spatial and temporal (time sharing)
 scheduling of 2D array. This means that spatial partitions may be setup and
 torn down dynamically to accommodate various workloads. A *spatial* partition
 may be *exclusively* bound to one workload context while another partition may
 be *temporarily* bound to more than one workload contexts. The microcontroller
 updates the PASID for a temporarily shared partition to match the context that
 has been bound to the partition at any moment.

 Resource Solver
 ---------------

 The Resource Solver component of the amdxdna driver manages the allocation
 of 2D array among various workloads. Every workload describes the number
 of columns required to run the NPU binary in its metadata. The Resource Solver
 component uses hints passed by the workload and its own heuristics to
 decide 2D array (re)partition strategy and mapping of workloads for spatial and
 temporal sharing of columns. The FW enforces the context-to-column(s) resource
 binding decisions made by the Resource Solver.

 AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload
 contexts. AMD Strix Point can support 16 concurrent workload contexts.


 Application Binaries
 ====================

 A NPU application workload is comprised of two separate binaries which are
 generated by the NPU compiler.

 1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition.
    The overlay contains instructions for setting up the stream switch
    configuration and ELF for the compute tiles. The overlay is loaded on the
    spatial partition bound to the workload by the associated ERT instance.
    Refer to the
    `Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details.

 2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial
    partition. ``ctrlcode`` is executed by the ERT running in protected mode on
    the microcontroller in the context of the workload. ``ctrlcode`` is made up
    of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the
    `AI Engine Run Time`_ for more details.


 Special Host Buffers
 ====================

 Per-context Instruction Buffer
 ------------------------------

 Every workload context uses a host resident 64 MB buffer which is memory
 mapped into the ERT instance created to service the workload. The ``ctrlcode``
 used by the workload is copied into this special memory. This buffer is
 protected by PASID like all other input/output buffers used by that workload.
 Instruction buffer is also mapped into the user space of the workload.

 Global Privileged Buffer
 ------------------------

 In addition, the driver also allocates a single buffer for maintenance tasks
 like recording errors from MERT. This global buffer uses the global IOMMU
 domain and is only accessible by MERT.


 High-level Use Flow
 ===================

 Here are the steps to run a workload on AMD NPU:

 1.  Compile the workload into an overlay and a ``ctrlcode`` binary.
 2.  Userspace opens a context in the driver and provides the overlay.
 3.  The driver checks with the Resource Solver for provisioning a set of columns
     for the workload.
 4.  The driver then asks MERT to create a context on the device with the desired
     columns.
 5.  MERT then creates an instance of ERT. MERT also maps the Instruction Buffer
     into ERT memory.
 6.  The userspace then copies the ``ctrlcode`` to the Instruction Buffer.
 7.  Userspace then creates a command buffer with pointers to input, output, and
     instruction buffer; it then submits command buffer with the driver and goes
     to sleep waiting for completion.
 8.  The driver sends the command over the Mailbox to ERT.
 9.  ERT *executes* the ``ctrlcode`` in the instruction buffer.
 10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while
     AMD XDNA Array is running.
 11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion
     signal to the driver which then wakes up the waiting workload.


 Boot Flow
 =========

 amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot
 of the NPU microcontroller. amdxdna driver then waits for the alive signal in
 a special location on BAR 0. The NPU is switched off during SoC suspend and
 turned on after resume where the NPU FW is reloaded, and the handshake is
 performed again.


 Userspace components
 ====================

 Compiler
 --------

 Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile
 available at:
 https://github.com/Xilinx/llvm-aie

 The open-source IREE compiler supports graph compilation of ML models for AMD
 NPU and uses Peano underneath. It is available at:
 https://github.com/nod-ai/iree-amd-aie

 Usermode Driver (UMD)
 ---------------------

 The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT
 can be found at:
 https://github.com/Xilinx/XRT

 The open-source XRT shim for NPU is can be found at:
 https://github.com/amd/xdna-driver


 DMA Operation
 =============

 DMA operation instructions are encoded in the ``ctrlcode`` as
 ``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA
 operations between host DDR and L2 memory are effected.


 Error Handling
 ==============

 When MERT detects an error in AMD XDNA Array, it pauses execution for that
 workload context and sends an asynchronous message to the driver over the
 privileged channel. The driver then sends a buffer pointer to MERT to capture
 the register states for the partition bound to faulting workload context. The
 driver then decodes the error by reading the contents of the buffer pointer.


 Telemetry
 =========

 MERT can report various kinds of telemetry information like the following:

 * L1 interrupt counter
 * DMA counter
 * Deep Sleep counter
 * etc.


 References
 ==========

 - `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_
 - `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_
 - `Peano <https://github.com/Xilinx/llvm-aie>`_
 - `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_
 - `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_
	.. SPDX-License-Identifier: GPL-2.0-only

	.. include:: <isonum.txt>

	=========
	AMD NPU
	=========

	:Copyright: \|copy\| 2024 Advanced Micro Devices, Inc.
	:Author: Sonal Santan <sonal.santan@amd.com>

	Overview
	========

	AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator
	integrated into AMD client APU. NPU enables efficient execution of Machine
	Learning applications like CNN, LLM, etc. NPU is based on
	`AMD XDNA Architecture`_. NPU is managed by amdxdna driver.


	Hardware Description
	====================

	AMD NPU consists of the following hardware components:

	AMD XDNA Array
	--------------

	AMD XDNA Array comprises of 2D array of compute and memory tiles built with
	`AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1
	row of memory tile. Each compute tile contains a VLIW processor with its own
	dedicated program and data memory. The memory tile acts as L2 memory. The 2D
	array can be partitioned at a column boundary creating a spatially isolated
	partition which can be bound to a workload context.

	Each column also has dedicated DMA engines to move data between host DDR and
	memory tile.

	AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of
	compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8
	topology, i.e., 4 rows of compute tiles arranged into 8 columns.

	Shared L2 Memory
	----------------

	The single row of memory tiles create a pool of software managed on chip L2
	memory. DMA engines are used to move data between host DDR and memory tiles.
	AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory.
	AMD Strix Point NPU has a total of 4096 KB of L2 memory.

	Microcontroller
	---------------

	A microcontroller runs NPU Firmware which is responsible for command processing,
	XDNA Array partition setup, XDNA Array configuration, workload context
	management and workload orchestration.

	NPU Firmware uses a dedicated instance of an isolated non-privileged context
	called ERT to service each workload context. ERT is also used to execute user
	provided ``ctrlcode`` associated with the workload context.

	NPU Firmware uses a single isolated privileged context called MERT to service
	management commands from the amdxdna driver.

	Mailboxes
	---------

	The microcontroller and amdxdna driver use a privileged channel for management
	tasks like setting up of contexts, telemetry, query, error handling, setting up
	user channel, etc. As mentioned before, privileged channel requests are
	serviced by MERT. The privileged channel is bound to a single mailbox.

	The microcontroller and amdxdna driver use a dedicated user channel per
	workload context. The user channel is primarily used for submitting work to
	the NPU. As mentioned before, a user channel requests are serviced by an
	instance of ERT. Each user channel is bound to its own dedicated mailbox.

	PCIe EP
	-------

	NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some
	MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric
	for reading or writing into host memory. Each instance of ERT gets its own
	dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt.

	The number of PCIe BARs varies depending on the specific device. Based on their
	functions, PCIe BARs can generally be categorized into the following types.

	* PSP BAR: Expose the AMD PSP (Platform Security Processor) function
	* SMU BAR: Expose the AMD SMU (System Management Unit) function
	* SRAM BAR: Expose ring buffers for the mailbox
	* Mailbox BAR: Expose the mailbox control registers (head, tail and ISR
	registers etc.)
	* Public Register BAR: Expose public registers

	On specific devices, the above-mentioned BAR type might be combined into a
	single physical PCIe BAR. Or a module might require two physical PCIe BARs to
	be fully functional. For example,

	* On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0.
	* On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR
	index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR)
	and PCIe BAR index 4 (PSP BAR).

	Process Isolation Hardware
	--------------------------

	As explained before, XDNA Array can be dynamically divided into isolated
	spatial partitions, each of which may have one or more columns. The spatial
	partition is setup by programming the column isolation registers by the
	microcontroller. Each spatial partition is associated with a PASID which is
	also programmed by the microcontroller. Hence multiple spatial partitions in
	the NPU can make concurrent host access protected by PASID.

	The NPU FW itself uses microcontroller MMU enforced isolated contexts for
	servicing user and privileged channel requests.


	Mixed Spatial and Temporal Scheduling
	=====================================

	AMD XDNA architecture supports mixed spatial and temporal (time sharing)
	scheduling of 2D array. This means that spatial partitions may be setup and
	torn down dynamically to accommodate various workloads. A spatial partition
	may be exclusively bound to one workload context while another partition may
	be temporarily bound to more than one workload contexts. The microcontroller
	updates the PASID for a temporarily shared partition to match the context that
	has been bound to the partition at any moment.

	Resource Solver
	---------------

	The Resource Solver component of the amdxdna driver manages the allocation
	of 2D array among various workloads. Every workload describes the number
	of columns required to run the NPU binary in its metadata. The Resource Solver
	component uses hints passed by the workload and its own heuristics to
	decide 2D array (re)partition strategy and mapping of workloads for spatial and
	temporal sharing of columns. The FW enforces the context-to-column(s) resource
	binding decisions made by the Resource Solver.

	AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload
	contexts. AMD Strix Point can support 16 concurrent workload contexts.


	Application Binaries
	====================

	A NPU application workload is comprised of two separate binaries which are
	generated by the NPU compiler.

	1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition.
	The overlay contains instructions for setting up the stream switch
	configuration and ELF for the compute tiles. The overlay is loaded on the
	spatial partition bound to the workload by the associated ERT instance.
	Refer to the
	`Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details.

	2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial
	partition. ``ctrlcode`` is executed by the ERT running in protected mode on
	the microcontroller in the context of the workload. ``ctrlcode`` is made up
	of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the
	`AI Engine Run Time`_ for more details.


	Special Host Buffers
	====================

	Per-context Instruction Buffer
	------------------------------

	Every workload context uses a host resident 64 MB buffer which is memory
	mapped into the ERT instance created to service the workload. The ``ctrlcode``
	used by the workload is copied into this special memory. This buffer is
	protected by PASID like all other input/output buffers used by that workload.
	Instruction buffer is also mapped into the user space of the workload.

	Global Privileged Buffer
	------------------------

	In addition, the driver also allocates a single buffer for maintenance tasks
	like recording errors from MERT. This global buffer uses the global IOMMU
	domain and is only accessible by MERT.


	High-level Use Flow
	===================

	Here are the steps to run a workload on AMD NPU:

	1. Compile the workload into an overlay and a ``ctrlcode`` binary.
	2. Userspace opens a context in the driver and provides the overlay.
	3. The driver checks with the Resource Solver for provisioning a set of columns
	for the workload.
	4. The driver then asks MERT to create a context on the device with the desired
	columns.
	5. MERT then creates an instance of ERT. MERT also maps the Instruction Buffer
	into ERT memory.
	6. The userspace then copies the ``ctrlcode`` to the Instruction Buffer.
	7. Userspace then creates a command buffer with pointers to input, output, and
	instruction buffer; it then submits command buffer with the driver and goes
	to sleep waiting for completion.
	8. The driver sends the command over the Mailbox to ERT.
	9. ERT executes the ``ctrlcode`` in the instruction buffer.
	10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while
	AMD XDNA Array is running.
	11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion
	signal to the driver which then wakes up the waiting workload.


	Boot Flow
	=========

	amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot
	of the NPU microcontroller. amdxdna driver then waits for the alive signal in
	a special location on BAR 0. The NPU is switched off during SoC suspend and
	turned on after resume where the NPU FW is reloaded, and the handshake is
	performed again.


	Userspace components
	====================

	Compiler
	--------

	Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile
	available at:
	https://github.com/Xilinx/llvm-aie

	The open-source IREE compiler supports graph compilation of ML models for AMD
	NPU and uses Peano underneath. It is available at:
	https://github.com/nod-ai/iree-amd-aie

	Usermode Driver (UMD)
	---------------------

	The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT
	can be found at:
	https://github.com/Xilinx/XRT

	The open-source XRT shim for NPU is can be found at:
	https://github.com/amd/xdna-driver


	DMA Operation
	=============

	DMA operation instructions are encoded in the ``ctrlcode`` as
	``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA
	operations between host DDR and L2 memory are effected.


	Error Handling
	==============

	When MERT detects an error in AMD XDNA Array, it pauses execution for that
	workload context and sends an asynchronous message to the driver over the
	privileged channel. The driver then sends a buffer pointer to MERT to capture
	the register states for the partition bound to faulting workload context. The
	driver then decodes the error by reading the contents of the buffer pointer.


	Telemetry
	=========

	MERT can report various kinds of telemetry information like the following:

	* L1 interrupt counter
	* DMA counter
	* Deep Sleep counter
	* etc.


	References
	==========

	- `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_
	- `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_
	- `Peano <https://github.com/Xilinx/llvm-aie>`_
	- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_
	- `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_