backends/vulkan/README.md - platform/external/executorch - Git at Google

 # ExecuTorch Vulkan Delegate

 The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
 built on top of the cross-platform Vulkan GPU API standard. It is primarily
 designed to leverage the GPU to accelerate model inference on Android devices,
 but can be used on any platform that supports an implementation of Vulkan:
 laptops, servers, and edge devices.

 ::::{note}
 The Vulkan delegate is currently under active development, and its components
 are subject to change.
 ::::

 ## What is Vulkan?

 Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
 It is designed to offer developers more explicit control over GPUs compared to
 previous specifications in order to reduce overhead and maximize the
 capabilities of the modern graphics hardware.

 Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
 desktop and mobile) in the market support Vulkan. Vulkan is also included in
 Android from Android 7.0 onwards.

 **Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
 provides a way to execute compute and graphics operations on a GPU, but does not
 come with a built-in library of performant compute kernels.

 ## The Vulkan Compute Library

 The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
 the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
 provide GPU implementations for PyTorch operators via GLSL compute shaders.

 The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
 The core components of the PyTorch Vulkan backend were forked into ExecuTorch
 and adapted for an AOT graph-mode style of model inference (as opposed to
 PyTorch which adopted an eager execution style of model inference).

 The components of the Vulkan Compute Library are contained in the
 `executorch/backends/vulkan/runtime/` directory. The core components are listed
 and described below:

 ```
 runtime/
 ├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
 └── graph/ .................. ComputeGraph class which implements graph mode inference
     └── ops/ ................ Base directory for operator implementations
         ├── glsl/ ........... GLSL compute shaders
         │   ├── *.glsl
         │   └── conv2d.glsl
         └── impl/ ........... C++ code to dispatch GPU compute shaders
             ├── *.cpp
             └── Conv2d.cpp
 ```

 ## Features

 The Vulkan delegate currently supports the following features:

 * **Memory Planning**
   * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
 * **Capability Based Partitioning**:
   * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
 * **Support for upper-bound dynamic shapes**:
   * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering

 In addition to increasing operator coverage, the following features are
 currently in development:

 * **Quantization Support**
   * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
 * **Memory Layout Management**
   * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
 * **Selective Build**
   * We plan to make it possible to control build size by selecting which operators/shaders you want to build with

 ## End to End Example

 To further understand the features of the Vulkan Delegate and how to use it,
 consider the following end to end example with a simple single operator model.

 ### Compile and lower a model to the Vulkan Delegate

 Assuming ExecuTorch has been set up and installed, the following script can be
 used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.

 Once ExecuTorch has been set up and installed, the following script can be used
 to generate a simple model and lower it to the Vulkan delegate.

 ```
 # Note: this script is the same as the script from the "Setting up ExecuTorch"
 # page, with one minor addition to lower to the Vulkan backend.
 import torch
 from torch.export import export
 from executorch.exir import to_edge

 from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner

 # Start with a PyTorch model that adds two input tensors (matrices)
 class Add(torch.nn.Module):
   def __init__(self):
     super(Add, self).__init__()

   def forward(self, x: torch.Tensor, y: torch.Tensor):
       return x + y

 # 1. torch.export: Defines the program with the ATen operator set.
 aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))

 # 2. to_edge: Make optimizations for Edge devices
 edge_program = to_edge(aten_dialect)
 # 2.1 Lower to the Vulkan backend
 edge_program = edge_program.to_backend(VulkanPartitioner())

 # 3. to_executorch: Convert the graph to an ExecuTorch program
 executorch_program = edge_program.to_executorch()

 # 4. Save the compiled .pte program
 with open("vk_add.pte", "wb") as file:
     file.write(executorch_program.buffer)
 ```

 Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
 using the `to_backend()` API. The Vulkan Delegate implements the
 `VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
 that are supported by the Vulkan delegate, and separates compatible sections of
 the model to be executed on the GPU.

 This means the a model can be lowered to the Vulkan delegate even if it contains
 some unsupported operators. This will just mean that only parts of the graph
 will be executed on the GPU.


 ::::{note}
 The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/supported_ops.py)
 Vulkan partitioner code can be inspected to examine which ops are currently
 implemented in the Vulkan delegate.
 ::::

 ### Build Vulkan Delegate libraries

 The easiest way to build and test the Vulkan Delegate is to build for Android
 and test on a local Android device. Android devices have built in support for
 Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
 compile the Vulkan Compute Library's GLSL compute shaders.

 The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
 when building with CMake.

 First, make sure that you have the Android NDK installed; any NDK version past
 NDK r19c should work. Note that the examples in this doc have been validated with
 NDK r27b. The Android SDK should also be installed so that you have access to `adb`.

 The instructions in this page assumes that the following environment variables
 are set.

 ```shell
 export ANDROID_NDK=<path_to_ndk>
 # Select the appropriate Android ABI for your device
 export ANDROID_ABI=arm64-v8a
 # All subsequent commands should be performed from ExecuTorch repo root
 cd <path_to_executorch_root>
 # Make sure adb works
 adb --version
 ```

 To build and install ExecuTorch libraries (for Android) with the Vulkan
 Delegate:

 ```shell
 # From executorch root directory
 (rm -rf cmake-android-out && \
   pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
     -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
     -DANDROID_ABI=$ANDROID_ABI \
     -DEXECUTORCH_BUILD_VULKAN=ON \
     -DPYTHON_EXECUTABLE=python \
     -Bcmake-android-out && \
   cmake --build cmake-android-out -j16 --target install)
 ```

 ### Run the Vulkan model on device

 ::::{note}
 Since operator support is currently limited, only binary arithmetic operators
 will run on the GPU. Expect inference to be slow as the majority of operators
 are being executed via Portable operators.
 ::::

 Now, the partially delegated model can be executed (partially) on your device's
 GPU!

 ```shell
 # Build a model runner binary linked with the Vulkan delegate libs
 cmake --build cmake-android-out --target vulkan_executor_runner -j32

 # Push model to device
 adb push vk_add.pte /data/local/tmp/vk_add.pte
 # Push binary to device
 adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin

 # Run the model
 adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
 ```
	# ExecuTorch Vulkan Delegate

	The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
	built on top of the cross-platform Vulkan GPU API standard. It is primarily
	designed to leverage the GPU to accelerate model inference on Android devices,
	but can be used on any platform that supports an implementation of Vulkan:
	laptops, servers, and edge devices.

	::::{note}
	The Vulkan delegate is currently under active development, and its components
	are subject to change.
	::::

	## What is Vulkan?

	Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
	It is designed to offer developers more explicit control over GPUs compared to
	previous specifications in order to reduce overhead and maximize the
	capabilities of the modern graphics hardware.

	Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
	desktop and mobile) in the market support Vulkan. Vulkan is also included in
	Android from Android 7.0 onwards.

	Note that Vulkan is a GPU API, not a GPU Math Library. That is to say it
	provides a way to execute compute and graphics operations on a GPU, but does not
	come with a built-in library of performant compute kernels.

	## The Vulkan Compute Library

	The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
	the Vulkan Compute Library. The aim of the Vulkan Compute Library is to
	provide GPU implementations for PyTorch operators via GLSL compute shaders.

	The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
	The core components of the PyTorch Vulkan backend were forked into ExecuTorch
	and adapted for an AOT graph-mode style of model inference (as opposed to
	PyTorch which adopted an eager execution style of model inference).

	The components of the Vulkan Compute Library are contained in the
	`executorch/backends/vulkan/runtime/` directory. The core components are listed
	and described below:

	```
	runtime/
	├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
	└── graph/ .................. ComputeGraph class which implements graph mode inference
	└── ops/ ................ Base directory for operator implementations
	├── glsl/ ........... GLSL compute shaders
	│ ├── *.glsl
	│ └── conv2d.glsl
	└── impl/ ........... C++ code to dispatch GPU compute shaders
	├── *.cpp
	└── Conv2d.cpp
	```

	## Features

	The Vulkan delegate currently supports the following features:

	* Memory Planning
	* Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
	* Capability Based Partitioning:
	* A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
	* Support for upper-bound dynamic shapes:
	* Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering

	In addition to increasing operator coverage, the following features are
	currently in development:

	* Quantization Support
	* We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
	* Memory Layout Management
	* Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
	* Selective Build
	* We plan to make it possible to control build size by selecting which operators/shaders you want to build with

	## End to End Example

	To further understand the features of the Vulkan Delegate and how to use it,
	consider the following end to end example with a simple single operator model.

	### Compile and lower a model to the Vulkan Delegate

	Assuming ExecuTorch has been set up and installed, the following script can be
	used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.

	Once ExecuTorch has been set up and installed, the following script can be used
	to generate a simple model and lower it to the Vulkan delegate.

	```
	# Note: this script is the same as the script from the "Setting up ExecuTorch"
	# page, with one minor addition to lower to the Vulkan backend.
	import torch
	from torch.export import export
	from executorch.exir import to_edge

	from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner

	# Start with a PyTorch model that adds two input tensors (matrices)
	class Add(torch.nn.Module):
	def __init__(self):
	super(Add, self).__init__()

	def forward(self, x: torch.Tensor, y: torch.Tensor):
	return x + y

	# 1. torch.export: Defines the program with the ATen operator set.
	aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))

	# 2. to_edge: Make optimizations for Edge devices
	edge_program = to_edge(aten_dialect)
	# 2.1 Lower to the Vulkan backend
	edge_program = edge_program.to_backend(VulkanPartitioner())

	# 3. to_executorch: Convert the graph to an ExecuTorch program
	executorch_program = edge_program.to_executorch()

	# 4. Save the compiled .pte program
	with open("vk_add.pte", "wb") as file:
	file.write(executorch_program.buffer)
	```

	Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
	using the `to_backend()` API. The Vulkan Delegate implements the
	`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
	that are supported by the Vulkan delegate, and separates compatible sections of
	the model to be executed on the GPU.

	This means the a model can be lowered to the Vulkan delegate even if it contains
	some unsupported operators. This will just mean that only parts of the graph
	will be executed on the GPU.


	::::{note}
	The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/supported_ops.py)
	Vulkan partitioner code can be inspected to examine which ops are currently
	implemented in the Vulkan delegate.
	::::

	### Build Vulkan Delegate libraries

	The easiest way to build and test the Vulkan Delegate is to build for Android
	and test on a local Android device. Android devices have built in support for
	Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
	compile the Vulkan Compute Library's GLSL compute shaders.

	The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
	when building with CMake.

	First, make sure that you have the Android NDK installed; any NDK version past
	NDK r19c should work. Note that the examples in this doc have been validated with
	NDK r27b. The Android SDK should also be installed so that you have access to `adb`.

	The instructions in this page assumes that the following environment variables
	are set.

	```shell
	export ANDROID_NDK=<path_to_ndk>
	# Select the appropriate Android ABI for your device
	export ANDROID_ABI=arm64-v8a
	# All subsequent commands should be performed from ExecuTorch repo root
	cd <path_to_executorch_root>
	# Make sure adb works
	adb --version
	```

	To build and install ExecuTorch libraries (for Android) with the Vulkan
	Delegate:

	```shell
	# From executorch root directory
	(rm -rf cmake-android-out && \
	pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
	-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
	-DANDROID_ABI=$ANDROID_ABI \
	-DEXECUTORCH_BUILD_VULKAN=ON \
	-DPYTHON_EXECUTABLE=python \
	-Bcmake-android-out && \
	cmake --build cmake-android-out -j16 --target install)
	```

	### Run the Vulkan model on device

	::::{note}
	Since operator support is currently limited, only binary arithmetic operators
	will run on the GPU. Expect inference to be slow as the majority of operators
	are being executed via Portable operators.
	::::

	Now, the partially delegated model can be executed (partially) on your device's
	GPU!

	```shell
	# Build a model runner binary linked with the Vulkan delegate libs
	cmake --build cmake-android-out --target vulkan_executor_runner -j32

	# Push model to device
	adb push vk_add.pte /data/local/tmp/vk_add.pte
	# Push binary to device
	adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin

	# Run the model
	adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
	```