libc/docs/gpu/using.rst - toolchain/llvm-project - Git at Google

 .. _libc_gpu_usage:

 ===================
 Using libc for GPUs
 ===================

 .. contents:: Table of Contents
   :depth: 4
   :local:

 Using the GPU C library
 =======================

 Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it
 can be used to run libc or libm functions directly on the GPU. Currently, not
 all C standard functions are supported on the GPU. Consult the :ref:`list of
 supported functions<libc_gpu_support>` for a comprehensive list.

 The GPU C library supports two main usage modes. The first is as a supplementary
 library for offloading languages such as OpenMP, CUDA, or HIP. These aim to
 provide standard system utilities similarly to existing vendor libraries. The
 second method treats the GPU as a hosted target by compiling C or C++ for it
 directly. This is more similar to targeting OpenCL and is primarily used for
 exported functions on the GPU and testing.

 Offloading usage
 ----------------

 Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source
 file for both the host target and a list of offloading devices. In order to
 support standard compilation flows, the ``clang`` driver uses fat binaries,
 described in the `clang documentation
 <https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used
 by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains
 through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags.

 In order or link the GPU runtime, we simply pass this library to the embedded
 device linker job. This can be done using the ``-Xoffload-linker`` option, which
 forwards an argument to a ``clang`` job used to create the final GPU executable.
 The toolchain should pick up the C libraries automatically in most cases, so
 this shouldn't be necessary.

 .. code-block:: sh

   $> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc
   $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
   $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc

 This will automatically link in the needed function definitions if they were
 required by the user's application. Normally using the ``-fgpu-rdc`` option
 results in sub-par performance due to ABA linking. However, the offloading
 toolchain supports the ``--foffload-lto`` option to support LTO on the target
 device.

 Offloading languages require that functions present on the device be declared as
 such. This is done with the ``__device__`` keyword in CUDA and HIP or the
 ``declare target`` pragma in OpenMP. This requires that the LLVM C library
 exposes its implemented functions to the compiler when it is used to build. We
 support this by providing wrapper headers in the compiler's resource directory.
 These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
 installation.

 The support for HIP and CUDA is more experimental, requiring manual intervention
 to link and use the facilities. An example of this is shown in the :ref:`CUDA
 server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
 completely integrated with the LLVM C library however. It will automatically
 handle including the necessary libraries, define device-side interfaces, and run
 the RPC server.

 OpenMP Offloading example
 ^^^^^^^^^^^^^^^^^^^^^^^^^

 This section provides a simple example of compiling an OpenMP program with the
 GPU C library.

 .. code-block:: c++

   #include <stdio.h>

   int main() {
     FILE *file = stderr;
   #pragma omp target teams num_teams(2) thread_limit(2)
   #pragma omp parallel num_threads(2)
     { fputs("Hello from OpenMP!\n", file); }
   }

 This can simply be compiled like any other OpenMP application to print from two
 threads and two blocks.

 .. code-block:: sh

   $> clang openmp.c -fopenmp --offload-arch=gfx90a
   $> ./a.out
   Hello from OpenMP!
   Hello from OpenMP!
   Hello from OpenMP!
   Hello from OpenMP!

 Including the wrapper headers, linking the C library, and running the :ref:`RPC
 server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.

 Binary format
 ^^^^^^^^^^^^^

 The ``libcgpu.a`` static archive is a fat-binary containing LLVM-IR for each
 supported target device. The supported architectures can be seen using LLVM's
 ``llvm-objdump`` with the ``--offloading`` flag:

 .. code-block:: sh

   $> llvm-objdump --offloading libcgpu-amdgpu.a
   libcgpu-amdgpu.a(strcmp.cpp.o):    file format elf64-x86-64

   OFFLOADING IMAGE [0]:
   kind            llvm ir
   arch            generic
   triple          amdgcn-amd-amdhsa
   producer        none
   ...

 Because the device code is stored inside a fat binary, it can be difficult to
 inspect the resulting code. This can be done using the following utilities:

 .. code-block:: sh

   $> llvm-ar x libcgpu.a strcmp.cpp.o
   $> clang-offload-packager strcmp.cpp.o --image=arch=generic,file=strcmp.bc
   $> opt -S out.bc
   ...

 Please note that this fat binary format is provided for compatibility with
 existing offloading toolchains. The implementation in ``libc`` does not depend
 on any existing offloading languages and is completely freestanding.

 Direct compilation
 ------------------

 Instead of using standard offloading languages, we can also target the CPU
 directly using C and C++ to create a GPU executable similarly to OpenCL. This is
 done by targeting the GPU architecture using `clang's cross compilation
 support <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the
 method that the GPU C library uses both to build the library and to run tests.

 This allows us to easily define GPU specific libraries and programs that fit
 well into existing tools. In order to target the GPU effectively we rely heavily
 on the compiler's intrinsic and built-in functions. For example, the following
 function gets the thread identifier in the 'x' dimension on both GPUs supported
 GPUs.

 .. code-block:: c++

   uint32_t get_thread_id_x() {
   #if defined(__AMDGPU__)
     return __builtin_amdgcn_workitem_id_x();
   #elif defined(__NVPTX__)
     return __nvvm_read_ptx_sreg_tid_x();
   #else
   #error "Unsupported platform"
   #endif
   }

 We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
 following commands. This will yield valid LLVM-IR for the given target just like
 if we were using CUDA, OpenCL, or OpenMP.

 .. code-block:: sh

   $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
   $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c

 We can also use this support to treat the GPU as a hosted environment by
 providing a C library and startup object just like a standard C library running
 on the host machine. Then, in order to execute these programs, we provide a
 loader utility to launch the executable on the GPU similar to a cross-compiling
 emulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the
 GPU. This is clearly not the most efficient way to use a GPU, but it provides a
 simple method to test execution on a GPU for debugging or development.

 Building for AMDGPU targets
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The AMDGPU target supports several features natively by virtue of using ``lld``
 as its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
 and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
 the library. We can directly link against ``libc.a`` and use LTO to generate the
 final executable.

 .. code-block:: c++

   #include <stdio.h>

   int main() { fputs("Hello from AMDGPU!\n", stdout); }

 This program can then be compiled using the ``clang`` compiler. Note that
 ``-flto`` and ``-mcpu=`` should be defined. This is because the GPU
 sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help``
 for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs
 if present. Additionally, the AMDGPU target always uses ``-flto`` because we
 currently do not fully support ELF linking in ``lld``. Once built, we use the
 ``amdhsa-loader`` utility to launch execution on the GPU. This will be built if
 the ``hsa_runtime64`` library was found during build time.

 .. code-block:: sh

   $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
   $> amdhsa-loader --threads 2 --blocks 2 a.out
   Hello from AMDGPU!
   Hello from AMDGPU!
   Hello from AMDGPU!
   Hello from AMDGPU!

 This will include the ``stdio.h`` header, which is found in the
 ``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a
 standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o``
 will handle the necessary steps to execute the ``main`` function along with
 global initializers and command line arguments. Finally, we link in the
 ``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C
 functions.

 The search paths for the include directories and libraries are automatically
 handled by the compiler. We use this support internally to run unit tests on the
 GPU directly. See :ref:`libc_gpu_testing` for more information. The installation
 also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
 used instead of the static library.

 Building for NVPTX targets
 ^^^^^^^^^^^^^^^^^^^^^^^^^^

 The infrastructure is the same as the AMDGPU example. However, the NVPTX binary
 utilities are very limited and must be targeted directly. There is no linker
 support for static libraries so we need to link in the ``libc.bc`` bitcode and
 inform the compiler driver of the file's contents.

 .. code-block:: c++

   #include <stdio.h>

   int main(int argc, char **argv, char **envp) {
     fputs("Hello from NVPTX!\n", stdout);
   }

 Additionally, the NVPTX ABI requires that every function signature matches. This
 requires us to pass the full prototype from ``main``. The installation will
 contain the ``nvptx-loader`` utility if the CUDA driver was found during
 compilation.

 .. code-block:: sh

   $> clang hello.c --target=nvptx64-nvidia-cuda -march=native \
        -x ir <install>/lib/nvptx64-nvidia-cuda/libc.bc \
        -x ir <install>/lib/nvptx64-nvidia-cuda/crt1.o
   $> nvptx-loader --threads 2 --blocks 2 a.out
   Hello from NVPTX!
   Hello from NVPTX!
   Hello from NVPTX!
   Hello from NVPTX!
	.. _libc_gpu_usage:

	===================
	Using libc for GPUs
	===================

	.. contents:: Table of Contents
	:depth: 4
	:local:

	Using the GPU C library
	=======================

	Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it
	can be used to run libc or libm functions directly on the GPU. Currently, not
	all C standard functions are supported on the GPU. Consult the :ref:`list of
	supported functions<libc_gpu_support>` for a comprehensive list.

	The GPU C library supports two main usage modes. The first is as a supplementary
	library for offloading languages such as OpenMP, CUDA, or HIP. These aim to
	provide standard system utilities similarly to existing vendor libraries. The
	second method treats the GPU as a hosted target by compiling C or C++ for it
	directly. This is more similar to targeting OpenCL and is primarily used for
	exported functions on the GPU and testing.

	Offloading usage
	----------------

	Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source
	file for both the host target and a list of offloading devices. In order to
	support standard compilation flows, the ``clang`` driver uses fat binaries,
	described in the `clang documentation
	<https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used
	by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains
	through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags.

	In order or link the GPU runtime, we simply pass this library to the embedded
	device linker job. This can be done using the ``-Xoffload-linker`` option, which
	forwards an argument to a ``clang`` job used to create the final GPU executable.
	The toolchain should pick up the C libraries automatically in most cases, so
	this shouldn't be necessary.

	.. code-block:: sh

	$> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc
	$> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
	$> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc

	This will automatically link in the needed function definitions if they were
	required by the user's application. Normally using the ``-fgpu-rdc`` option
	results in sub-par performance due to ABA linking. However, the offloading
	toolchain supports the ``--foffload-lto`` option to support LTO on the target
	device.

	Offloading languages require that functions present on the device be declared as
	such. This is done with the ``__device__`` keyword in CUDA and HIP or the
	``declare target`` pragma in OpenMP. This requires that the LLVM C library
	exposes its implemented functions to the compiler when it is used to build. We
	support this by providing wrapper headers in the compiler's resource directory.
	These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
	installation.

	The support for HIP and CUDA is more experimental, requiring manual intervention
	to link and use the facilities. An example of this is shown in the :ref:`CUDA
	server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
	completely integrated with the LLVM C library however. It will automatically
	handle including the necessary libraries, define device-side interfaces, and run
	the RPC server.

	OpenMP Offloading example
	^^^^^^^^^^^^^^^^^^^^^^^^^

	This section provides a simple example of compiling an OpenMP program with the
	GPU C library.

	.. code-block:: c++

	#include <stdio.h>

	int main() {
	FILE *file = stderr;
	#pragma omp target teams num_teams(2) thread_limit(2)
	#pragma omp parallel num_threads(2)
	{ fputs("Hello from OpenMP!\n", file); }
	}

	This can simply be compiled like any other OpenMP application to print from two
	threads and two blocks.

	.. code-block:: sh

	$> clang openmp.c -fopenmp --offload-arch=gfx90a
	$> ./a.out
	Hello from OpenMP!
	Hello from OpenMP!
	Hello from OpenMP!
	Hello from OpenMP!

	Including the wrapper headers, linking the C library, and running the :ref:`RPC
	server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.

	Binary format
	^^^^^^^^^^^^^

	The ``libcgpu.a`` static archive is a fat-binary containing LLVM-IR for each
	supported target device. The supported architectures can be seen using LLVM's
	``llvm-objdump`` with the ``--offloading`` flag:

	.. code-block:: sh

	$> llvm-objdump --offloading libcgpu-amdgpu.a
	libcgpu-amdgpu.a(strcmp.cpp.o): file format elf64-x86-64

	OFFLOADING IMAGE [0]:
	kind llvm ir
	arch generic
	triple amdgcn-amd-amdhsa
	producer none
	...

	Because the device code is stored inside a fat binary, it can be difficult to
	inspect the resulting code. This can be done using the following utilities:

	.. code-block:: sh

	$> llvm-ar x libcgpu.a strcmp.cpp.o
	$> clang-offload-packager strcmp.cpp.o --image=arch=generic,file=strcmp.bc
	$> opt -S out.bc
	...

	Please note that this fat binary format is provided for compatibility with
	existing offloading toolchains. The implementation in ``libc`` does not depend
	on any existing offloading languages and is completely freestanding.

	Direct compilation
	------------------

	Instead of using standard offloading languages, we can also target the CPU
	directly using C and C++ to create a GPU executable similarly to OpenCL. This is
	done by targeting the GPU architecture using `clang's cross compilation
	support <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the
	method that the GPU C library uses both to build the library and to run tests.

	This allows us to easily define GPU specific libraries and programs that fit
	well into existing tools. In order to target the GPU effectively we rely heavily
	on the compiler's intrinsic and built-in functions. For example, the following
	function gets the thread identifier in the 'x' dimension on both GPUs supported
	GPUs.

	.. code-block:: c++

	uint32_t get_thread_id_x() {
	#if defined(__AMDGPU__)
	return __builtin_amdgcn_workitem_id_x();
	#elif defined(__NVPTX__)
	return __nvvm_read_ptx_sreg_tid_x();
	#else
	#error "Unsupported platform"
	#endif
	}

	We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
	following commands. This will yield valid LLVM-IR for the given target just like
	if we were using CUDA, OpenCL, or OpenMP.

	.. code-block:: sh

	$> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
	$> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c

	We can also use this support to treat the GPU as a hosted environment by
	providing a C library and startup object just like a standard C library running
	on the host machine. Then, in order to execute these programs, we provide a
	loader utility to launch the executable on the GPU similar to a cross-compiling
	emulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the
	GPU. This is clearly not the most efficient way to use a GPU, but it provides a
	simple method to test execution on a GPU for debugging or development.

	Building for AMDGPU targets
	^^^^^^^^^^^^^^^^^^^^^^^^^^^

	The AMDGPU target supports several features natively by virtue of using ``lld``
	as its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
	and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
	the library. We can directly link against ``libc.a`` and use LTO to generate the
	final executable.

	.. code-block:: c++

	#include <stdio.h>

	int main() { fputs("Hello from AMDGPU!\n", stdout); }

	This program can then be compiled using the ``clang`` compiler. Note that
	``-flto`` and ``-mcpu=`` should be defined. This is because the GPU
	sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help``
	for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs
	if present. Additionally, the AMDGPU target always uses ``-flto`` because we
	currently do not fully support ELF linking in ``lld``. Once built, we use the
	``amdhsa-loader`` utility to launch execution on the GPU. This will be built if
	the ``hsa_runtime64`` library was found during build time.

	.. code-block:: sh

	$> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
	$> amdhsa-loader --threads 2 --blocks 2 a.out
	Hello from AMDGPU!
	Hello from AMDGPU!
	Hello from AMDGPU!
	Hello from AMDGPU!

	This will include the ``stdio.h`` header, which is found in the
	``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a
	standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o``
	will handle the necessary steps to execute the ``main`` function along with
	global initializers and command line arguments. Finally, we link in the
	``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C
	functions.

	The search paths for the include directories and libraries are automatically
	handled by the compiler. We use this support internally to run unit tests on the
	GPU directly. See :ref:`libc_gpu_testing` for more information. The installation
	also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
	used instead of the static library.

	Building for NVPTX targets
	^^^^^^^^^^^^^^^^^^^^^^^^^^

	The infrastructure is the same as the AMDGPU example. However, the NVPTX binary
	utilities are very limited and must be targeted directly. There is no linker
	support for static libraries so we need to link in the ``libc.bc`` bitcode and
	inform the compiler driver of the file's contents.

	.. code-block:: c++

	#include <stdio.h>

	int main(int argc, char argv, char envp) {
	fputs("Hello from NVPTX!\n", stdout);
	}

	Additionally, the NVPTX ABI requires that every function signature matches. This
	requires us to pass the full prototype from ``main``. The installation will
	contain the ``nvptx-loader`` utility if the CUDA driver was found during
	compilation.

	.. code-block:: sh

	$> clang hello.c --target=nvptx64-nvidia-cuda -march=native \
	-x ir <install>/lib/nvptx64-nvidia-cuda/libc.bc \
	-x ir <install>/lib/nvptx64-nvidia-cuda/crt1.o
	$> nvptx-loader --threads 2 --blocks 2 a.out
	Hello from NVPTX!
	Hello from NVPTX!
	Hello from NVPTX!
	Hello from NVPTX!