| .. _libc_gpu_usage: |
| |
| =================== |
| Using libc for GPUs |
| =================== |
| |
| .. contents:: Table of Contents |
| :depth: 4 |
| :local: |
| |
| Using the GPU C library |
| ======================= |
| |
| Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it |
| can be used to run libc or libm functions directly on the GPU. Currently, not |
| all C standard functions are supported on the GPU. Consult the :ref:`list of |
| supported functions<libc_gpu_support>` for a comprehensive list. |
| |
| The GPU C library supports two main usage modes. The first is as a supplementary |
| library for offloading languages such as OpenMP, CUDA, or HIP. These aim to |
| provide standard system utilities similarly to existing vendor libraries. The |
| second method treats the GPU as a hosted target by compiling C or C++ for it |
| directly. This is more similar to targeting OpenCL and is primarily used for |
| exported functions on the GPU and testing. |
| |
| Offloading usage |
| ---------------- |
| |
| Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source |
| file for both the host target and a list of offloading devices. In order to |
| support standard compilation flows, the ``clang`` driver uses fat binaries, |
| described in the `clang documentation |
| <https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used |
| by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains |
| through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags. |
| |
| In order or link the GPU runtime, we simply pass this library to the embedded |
| device linker job. This can be done using the ``-Xoffload-linker`` option, which |
| forwards an argument to a ``clang`` job used to create the final GPU executable. |
| The toolchain should pick up the C libraries automatically in most cases, so |
| this shouldn't be necessary. |
| |
| .. code-block:: sh |
| |
| $> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc |
| $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc |
| $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc |
| |
| This will automatically link in the needed function definitions if they were |
| required by the user's application. Normally using the ``-fgpu-rdc`` option |
| results in sub-par performance due to ABA linking. However, the offloading |
| toolchain supports the ``--foffload-lto`` option to support LTO on the target |
| device. |
| |
| Offloading languages require that functions present on the device be declared as |
| such. This is done with the ``__device__`` keyword in CUDA and HIP or the |
| ``declare target`` pragma in OpenMP. This requires that the LLVM C library |
| exposes its implemented functions to the compiler when it is used to build. We |
| support this by providing wrapper headers in the compiler's resource directory. |
| These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your |
| installation. |
| |
| The support for HIP and CUDA is more experimental, requiring manual intervention |
| to link and use the facilities. An example of this is shown in the :ref:`CUDA |
| server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is |
| completely integrated with the LLVM C library however. It will automatically |
| handle including the necessary libraries, define device-side interfaces, and run |
| the RPC server. |
| |
| OpenMP Offloading example |
| ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| This section provides a simple example of compiling an OpenMP program with the |
| GPU C library. |
| |
| .. code-block:: c++ |
| |
| #include <stdio.h> |
| |
| int main() { |
| FILE *file = stderr; |
| #pragma omp target teams num_teams(2) thread_limit(2) |
| #pragma omp parallel num_threads(2) |
| { fputs("Hello from OpenMP!\n", file); } |
| } |
| |
| This can simply be compiled like any other OpenMP application to print from two |
| threads and two blocks. |
| |
| .. code-block:: sh |
| |
| $> clang openmp.c -fopenmp --offload-arch=gfx90a |
| $> ./a.out |
| Hello from OpenMP! |
| Hello from OpenMP! |
| Hello from OpenMP! |
| Hello from OpenMP! |
| |
| Including the wrapper headers, linking the C library, and running the :ref:`RPC |
| server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime. |
| |
| Binary format |
| ^^^^^^^^^^^^^ |
| |
| The ``libcgpu.a`` static archive is a fat-binary containing LLVM-IR for each |
| supported target device. The supported architectures can be seen using LLVM's |
| ``llvm-objdump`` with the ``--offloading`` flag: |
| |
| .. code-block:: sh |
| |
| $> llvm-objdump --offloading libcgpu-amdgpu.a |
| libcgpu-amdgpu.a(strcmp.cpp.o): file format elf64-x86-64 |
| |
| OFFLOADING IMAGE [0]: |
| kind llvm ir |
| arch generic |
| triple amdgcn-amd-amdhsa |
| producer none |
| ... |
| |
| Because the device code is stored inside a fat binary, it can be difficult to |
| inspect the resulting code. This can be done using the following utilities: |
| |
| .. code-block:: sh |
| |
| $> llvm-ar x libcgpu.a strcmp.cpp.o |
| $> clang-offload-packager strcmp.cpp.o --image=arch=generic,file=strcmp.bc |
| $> opt -S out.bc |
| ... |
| |
| Please note that this fat binary format is provided for compatibility with |
| existing offloading toolchains. The implementation in ``libc`` does not depend |
| on any existing offloading languages and is completely freestanding. |
| |
| Direct compilation |
| ------------------ |
| |
| Instead of using standard offloading languages, we can also target the CPU |
| directly using C and C++ to create a GPU executable similarly to OpenCL. This is |
| done by targeting the GPU architecture using `clang's cross compilation |
| support <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the |
| method that the GPU C library uses both to build the library and to run tests. |
| |
| This allows us to easily define GPU specific libraries and programs that fit |
| well into existing tools. In order to target the GPU effectively we rely heavily |
| on the compiler's intrinsic and built-in functions. For example, the following |
| function gets the thread identifier in the 'x' dimension on both GPUs supported |
| GPUs. |
| |
| .. code-block:: c++ |
| |
| uint32_t get_thread_id_x() { |
| #if defined(__AMDGPU__) |
| return __builtin_amdgcn_workitem_id_x(); |
| #elif defined(__NVPTX__) |
| return __nvvm_read_ptx_sreg_tid_x(); |
| #else |
| #error "Unsupported platform" |
| #endif |
| } |
| |
| We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the |
| following commands. This will yield valid LLVM-IR for the given target just like |
| if we were using CUDA, OpenCL, or OpenMP. |
| |
| .. code-block:: sh |
| |
| $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c |
| $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c |
| |
| We can also use this support to treat the GPU as a hosted environment by |
| providing a C library and startup object just like a standard C library running |
| on the host machine. Then, in order to execute these programs, we provide a |
| loader utility to launch the executable on the GPU similar to a cross-compiling |
| emulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the |
| GPU. This is clearly not the most efficient way to use a GPU, but it provides a |
| simple method to test execution on a GPU for debugging or development. |
| |
| Building for AMDGPU targets |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The AMDGPU target supports several features natively by virtue of using ``lld`` |
| as its linker. The installation will include the ``include/amdgcn-amd-amdhsa`` |
| and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use |
| the library. We can directly link against ``libc.a`` and use LTO to generate the |
| final executable. |
| |
| .. code-block:: c++ |
| |
| #include <stdio.h> |
| |
| int main() { fputs("Hello from AMDGPU!\n", stdout); } |
| |
| This program can then be compiled using the ``clang`` compiler. Note that |
| ``-flto`` and ``-mcpu=`` should be defined. This is because the GPU |
| sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help`` |
| for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs |
| if present. Additionally, the AMDGPU target always uses ``-flto`` because we |
| currently do not fully support ELF linking in ``lld``. Once built, we use the |
| ``amdhsa-loader`` utility to launch execution on the GPU. This will be built if |
| the ``hsa_runtime64`` library was found during build time. |
| |
| .. code-block:: sh |
| |
| $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o |
| $> amdhsa-loader --threads 2 --blocks 2 a.out |
| Hello from AMDGPU! |
| Hello from AMDGPU! |
| Hello from AMDGPU! |
| Hello from AMDGPU! |
| |
| This will include the ``stdio.h`` header, which is found in the |
| ``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a |
| standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o`` |
| will handle the necessary steps to execute the ``main`` function along with |
| global initializers and command line arguments. Finally, we link in the |
| ``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C |
| functions. |
| |
| The search paths for the include directories and libraries are automatically |
| handled by the compiler. We use this support internally to run unit tests on the |
| GPU directly. See :ref:`libc_gpu_testing` for more information. The installation |
| also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be |
| used instead of the static library. |
| |
| Building for NVPTX targets |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The infrastructure is the same as the AMDGPU example. However, the NVPTX binary |
| utilities are very limited and must be targeted directly. There is no linker |
| support for static libraries so we need to link in the ``libc.bc`` bitcode and |
| inform the compiler driver of the file's contents. |
| |
| .. code-block:: c++ |
| |
| #include <stdio.h> |
| |
| int main(int argc, char **argv, char **envp) { |
| fputs("Hello from NVPTX!\n", stdout); |
| } |
| |
| Additionally, the NVPTX ABI requires that every function signature matches. This |
| requires us to pass the full prototype from ``main``. The installation will |
| contain the ``nvptx-loader`` utility if the CUDA driver was found during |
| compilation. |
| |
| .. code-block:: sh |
| |
| $> clang hello.c --target=nvptx64-nvidia-cuda -march=native \ |
| -x ir <install>/lib/nvptx64-nvidia-cuda/libc.bc \ |
| -x ir <install>/lib/nvptx64-nvidia-cuda/crt1.o |
| $> nvptx-loader --threads 2 --blocks 2 a.out |
| Hello from NVPTX! |
| Hello from NVPTX! |
| Hello from NVPTX! |
| Hello from NVPTX! |