| ============================= |
| Offloading Design & Internals |
| ============================= |
| |
| .. contents:: |
| :local: |
| |
| Introduction |
| ============ |
| |
| This document describes the Clang driver and code generation steps for creating |
| offloading applications. Clang supports offloading to various architectures |
| using programming models like CUDA, HIP, and OpenMP. The purpose of this |
| document is to illustrate the steps necessary to create an offloading |
| application using Clang. |
| |
| OpenMP Offloading |
| ================= |
| |
| Clang supports OpenMP target offloading to several different architectures such |
| as NVPTX, AMDGPU, X86_64, Arm, and PowerPC. Offloading code is generated by |
| Clang and then executed using the ``libomptarget`` runtime and the associated |
| plugin for the target architecture, e.g. ``libomptarget.rtl.cuda``. This section |
| describes the steps necessary to create a functioning device image that can be |
| loaded by the OpenMP runtime. More information on the OpenMP runtimes can be |
| found at the `OpenMP documentation page <https://openmp.llvm.org>`__. |
| |
| .. _Offloading Overview: |
| |
| Offloading Overview |
| ------------------- |
| |
| The goal of offloading compilation is to create an executable device image that |
| can be run on the target device. OpenMP offloading creates executable images by |
| compiling the input file for both the host and the target device. The output |
| from the device phase then needs to be embedded into the host to create a fat |
| object. A special tool then needs to extract the device code from the fat |
| objects, run the device linking step, and embed the final image in a symbol the |
| host runtime library can use to register the library and access the symbols on |
| the device. |
| |
| Compilation Process |
| ^^^^^^^^^^^^^^^^^^^ |
| |
| The compiler performs the following high-level actions to generate OpenMP |
| offloading code: |
| |
| * Compile the input file for the host to produce a bitcode file. Lower ``#pragma |
| omp target`` declarations to :ref:`offloading entries <Generating Offloading |
| Entries>` and create metadata to indicate which entries are on the device. |
| * Compile the input file for the target :ref:`device <Device Compilation>` using |
| the :ref:`offloading entry <Generating Offloading Entries>` metadata created |
| by the host. |
| * Link the OpenMP device runtime library and run the backend to create a device |
| object file. |
| * Run the backend on the host bitcode file and create a :ref:`fat object file |
| <Creating Fat Objects>` using the device object file. |
| * Pass the fat object file to the :ref:`linker wrapper tool <Device Linking>` |
| and extract the device objects. Run the device linking action on the extracted |
| objects. |
| * :ref:`Wrap <Device Binary Wrapping>` the :ref:`device images <Device linking>` |
| and :ref:`offload entries <Generating Offloading Entries>` in a symbol that |
| can be accessed by the host. |
| * Add the :ref:`wrapped binary <Device Binary Wrapping>` to the linker input and |
| run the host linking action. Link with ``libomptarget`` to register and |
| execute the images. |
| |
| .. _Generating Offloading Entries: |
| |
| Generating Offloading Entries |
| ----------------------------- |
| |
| The first step in compilation is to generate offloading entries for the host. |
| This information is used to identify function kernels or global values that will |
| be provided by the device. Blocks contained in a ``#pragma omp target`` or |
| symbols inside a ``#pragma omp declare target`` directive will have offloading |
| entries generated. The following table shows the :ref:`offload entry structure |
| <table-tgt_offload_entry_structure>`. |
| |
| .. table:: __tgt_offload_entry Structure |
| :name: table-tgt_offload_entry_structure |
| |
| +---------+------------+------------------------------------------------------------------------+ |
| | Type | Identifier | Description | |
| +=========+============+========================================================================+ |
| | void* | addr | Address of global symbol within device image (function or global) | |
| +---------+------------+------------------------------------------------------------------------+ |
| | char* | name | Name of the symbol | |
| +---------+------------+------------------------------------------------------------------------+ |
| | size_t | size | Size of the entry info (0 if it is a function) | |
| +---------+------------+------------------------------------------------------------------------+ |
| | int32_t | flags | Flags associated with the entry (see :ref:`table-offload_entry_flags`) | |
| +---------+------------+------------------------------------------------------------------------+ |
| | int32_t | reserved | Reserved, to be used by the runtime library. | |
| +---------+------------+------------------------------------------------------------------------+ |
| |
| The address of the global symbol will be set to the device pointer value by the |
| runtime once the device image is loaded. The flags are set to indicate the |
| handling required for the offloading entry. If the offloading entry is an entry |
| to a target region it can have one of the following :ref:`entry flags |
| <table-offload_entry_flags>`. |
| |
| .. table:: Target Region Entry Flags |
| :name: table-offload_entry_flags |
| |
| +----------------------------------+-------+-----------------------------------------+ |
| | Name | Value | Description | |
| +==================================+=======+=========================================+ |
| | OMPTargetRegionEntryTargetRegion | 0x00 | Mark the entry as generic target region | |
| +----------------------------------+-------+-----------------------------------------+ |
| | OMPTargetRegionEntryCtor | 0x02 | Mark the entry as a global constructor | |
| +----------------------------------+-------+-----------------------------------------+ |
| | OMPTargetRegionEntryDtor | 0x04 | Mark the entry as a global destructor | |
| +----------------------------------+-------+-----------------------------------------+ |
| |
| If the offloading entry is a global variable, indicated by a non-zero size, it |
| will instead have one of the following :ref:`global |
| <table-offload_global_flags>` flags. |
| |
| .. table:: Target Region Global |
| :name: table-offload_global_flags |
| |
| +-----------------------------+-------+---------------------------------------------------------------+ |
| | Name | Value | Description | |
| +=============================+=======+===============================================================+ |
| | OMPTargetGlobalVarEntryTo | 0x00 | Mark the entry as a 'to' attribute (w.r.t. the to clause) | |
| +-----------------------------+-------+---------------------------------------------------------------+ |
| | OMPTargetGlobalVarEntryLink | 0x01 | Mark the entry as a 'link' attribute (w.r.t. the link clause) | |
| +-----------------------------+-------+---------------------------------------------------------------+ |
| |
| The target offload entries are used by the runtime to access the device kernels |
| and globals that will be provided by the final device image. Each offloading |
| entry is set to use the ``omp_offloading_entries`` section. When the final |
| application is created the linker will provide the |
| ``__start_omp_offloading_entries`` and ``__stop_omp_offloading_entries`` symbols |
| which are used to create the :ref:`final image <Device Binary Wrapping>`. |
| |
| This information is used by the device compilation stage to determine which |
| symbols need to be exported from the device. We use the ``omp_offload.info`` |
| metadata node to pass this information device compilation stage. |
| |
| Accessing Entries on the Device |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Accessing the entries in the device is done using the address field in the |
| :ref:`offload entry<table-tgt_offload_entry_structure>`. The runtime will set |
| the address to the pointer associated with the device image during runtime |
| initialization. This is used to call the corresponding kernel function when |
| entering a ``#pragma omp target`` region. For variables, the runtime maintains a |
| table mapping host pointers to device pointers. Global variables inside a |
| ``#pragma omp target declare`` directive are first initialized to the host's |
| address. Once the device address is initialized we insert it into the table to |
| map the host address to the device address. |
| |
| Debugging Information |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| We generate structures to hold debugging information that is passed to |
| ``libomptarget``. This allows the front-end to generate information the runtime |
| library uses for more informative error messages. This is done using the |
| standard :ref:`identifier structure <table-ident_t_structure>` used in |
| ``libomp`` and ``libomptarget``. This is used to pass information and source |
| locations to the runtime. |
| |
| .. table:: ident_t Structure |
| :name: table-ident_t_structure |
| |
| +---------+------------+-----------------------------------------------------------------------------+ |
| | Type | Identifier | Description | |
| +=========+============+=============================================================================+ |
| | int32_t | reserved | Reserved, to be used by the runtime library. | |
| +---------+------------+-----------------------------------------------------------------------------+ |
| | int32_t | flags | Flags used to indicate some features, mostly unused. | |
| +---------+------------+-----------------------------------------------------------------------------+ |
| | int32_t | reserved | Reserved, to be used by the runtime library. | |
| +---------+------------+-----------------------------------------------------------------------------+ |
| | int32_t | reserved | Reserved, to be used by the runtime library. | |
| +---------+------------+-----------------------------------------------------------------------------+ |
| | char* | psource | Program source information, stored as ";filename;function;line;column;;\\0" | |
| +---------+------------+-----------------------------------------------------------------------------+ |
| |
| If debugging information is enabled, we will also create strings to indicate the |
| names and declarations of variables mapped in target regions. These have the |
| same format as the source location in the :ref:`identifier structure |
| <table-ident_t_structure>`, but the function name is replaced with the variable |
| name. |
| |
| .. _Device Compilation: |
| |
| Offload Device Compilation |
| -------------------------- |
| |
| The input file is compiled for each active device toolchain. The device |
| compilation stage is performed differently from the host stage. Namely, we do |
| not generate any offloading entries. This is set by passing the |
| ``-fopenmp-is-target-device`` flag to the front-end. We use the host bitcode to |
| determine which symbols to export from the device. The bitcode file is passed in |
| from the previous stage using the ``-fopenmp-host-ir-file-path`` flag. |
| Compilation is otherwise performed as it would be for any other target triple. |
| |
| When compiling for the OpenMP device, we set the visibility of all device |
| symbols to be ``protected`` by default. This improves performance and prevents a |
| class of errors where a symbol in the target device could preempt a host |
| library. |
| |
| The OpenMP runtime library is linked in during compilation to provide the |
| implementations for standard OpenMP functionality. For GPU targets this is done |
| by linking in a special bitcode library during compilation, (e.g. |
| ``libomptarget-nvptx64-sm_70.bc``) using the ``-mlink-builtin-bitcode`` flag. |
| Other device libraries, such as CUDA's libdevice, are also linked this way. If |
| the target is a standard architecture with an existing ``libomp`` |
| implementation, that will be linked instead. Finally, device tools are used to |
| create a relocatable device object file that can be embedded in the host. |
| |
| .. _Creating Fat Objects: |
| |
| Creating Fat Objects |
| -------------------- |
| |
| A fat binary is a binary file that contains information intended for another |
| device. We create a fat object by embedding the output of the device compilation |
| stage into the host as a named section. The output from the device compilation |
| is passed to the host backend using the ``-fembed-offload-object`` flag. This |
| embeds the device image into the ``.llvm.offloading`` section using a special |
| binary format that behaves like a string map. This binary format is used to |
| bundle metadata about the image so the linker can associate the proper device |
| linking action with the image. Each device image will start with the magic bytes |
| ``0x10FF10AD``. |
| |
| .. code-block:: llvm |
| |
| @llvm.embedded.object = private constant [1 x i8] c"\00", section ".llvm.offloading" |
| |
| The device code will then be placed in the corresponding section one the backend |
| is run on the host, creating a fat object. Using fat objects allows us to treat |
| offloading objects as standard host objects. The final object file should |
| contain the following :ref:`offloading sections <table-offloading_sections>`. We |
| will use this information when :ref:`Device Linking`. |
| |
| .. table:: Offloading Sections |
| :name: table-offloading_sections |
| |
| +----------------------------------+------------------------------------------------------------------------------+ |
| | Section | Description | |
| +==================================+==============================================================================+ |
| | omp_offloading_entries | Offloading entry information (see :ref:`table-tgt_offload_entry_structure`) | |
| +----------------------------------+------------------------------------------------------------------------------+ |
| | .llvm.offloading | Embedded device object file for the target device and architecture | |
| +----------------------------------+------------------------------------------------------------------------------+ |
| |
| .. _Device Linking: |
| |
| Linking Target Device Code |
| -------------------------- |
| |
| Objects containing :ref:`table-offloading_sections` require special handling to |
| create an executable device image. This is done using a Clang tool, see |
| :doc:`ClangLinkerWrapper` for more information. This tool works as a wrapper |
| over the host linking job. It scans the input object files for the offloading |
| section ``.llvm.offloading``. The device files stored in this section are then |
| extracted and passed to the appropriate linking job. The linked device image is |
| then :ref:`wrapped <Device Binary Wrapping>` to create the symbols used to load |
| the device image and link it with the host. |
| |
| The linker wrapper tool supports linking bitcode files through link time |
| optimization (LTO). This is used whenever the object files embedded in the host |
| contain LLVM bitcode. Bitcode will be embedded for architectures that do not |
| support a relocatable object format, such as AMDGPU or SPIR-V, or if the user |
| requested it using the ``-foffload-lto`` flag. |
| |
| .. _Device Binary Wrapping: |
| |
| Device Binary Wrapping |
| ---------------------- |
| |
| Various structures and functions are used to create the information necessary to |
| offload code on the device. We use the :ref:`linked device executable <Device |
| Linking>` with the corresponding offloading entries to create the symbols |
| necessary to load and execute the device image. |
| |
| Structure Types |
| ^^^^^^^^^^^^^^^ |
| |
| Several different structures are used to store offloading information. The |
| :ref:`device image structure <table-device_image_structure>` stores a single |
| linked device image and its associated offloading entries. The offloading |
| entries are stored using the ``__start_omp_offloading_entries`` and |
| ``__stop_omp_offloading_entries`` symbols generated by the linker using the |
| :ref:`table-tgt_offload_entry_structure`. |
| |
| .. table:: __tgt_device_image Structure |
| :name: table-device_image_structure |
| |
| +----------------------+--------------+----------------------------------------+ |
| | Type | Identifier | Description | |
| +======================+==============+========================================+ |
| | void* | ImageStart | Pointer to the target code start | |
| +----------------------+--------------+----------------------------------------+ |
| | void* | ImageEnd | Pointer to the target code end | |
| +----------------------+--------------+----------------------------------------+ |
| | __tgt_offload_entry* | EntriesBegin | Begin of table with all target entries | |
| +----------------------+--------------+----------------------------------------+ |
| | __tgt_offload_entry* | EntriesEnd | End of table (non inclusive) | |
| +----------------------+--------------+----------------------------------------+ |
| |
| The target :ref:`target binary descriptor <table-target_binary_descriptor>` is |
| used to store all binary images and offloading entries in an array. |
| |
| .. table:: __tgt_bin_desc Structure |
| :name: table-target_binary_descriptor |
| |
| +----------------------+------------------+------------------------------------------+ |
| | Type | Identifier | Description | |
| +======================+==================+==========================================+ |
| | int32_t | NumDeviceImages | Number of device types supported | |
| +----------------------+------------------+------------------------------------------+ |
| | __tgt_device_image* | DeviceImages | Array of device images (1 per dev. type) | |
| +----------------------+------------------+------------------------------------------+ |
| | __tgt_offload_entry* | HostEntriesBegin | Begin of table with all host entries | |
| +----------------------+------------------+------------------------------------------+ |
| | __tgt_offload_entry* | HostEntriesEnd | End of table (non inclusive) | |
| +----------------------+------------------+------------------------------------------+ |
| |
| Global Variables |
| ---------------- |
| |
| :ref:`table-global_variables` lists various global variables, along with their |
| type and their explicit ELF sections, which are used to store device images and |
| related symbols. |
| |
| .. table:: Global Variables |
| :name: table-global_variables |
| |
| +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| | Variable | Type | ELF Section | Description | |
| +================================+=====================+=========================+=========================================================+ |
| | __start_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | Begin symbol for the offload entries table. | |
| +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| | __stop_omp_offloading_entries | __tgt_offload_entry | .omp_offloading_entries | End symbol for the offload entries table. | |
| +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| | __dummy.omp_offloading.entry | __tgt_offload_entry | .omp_offloading_entries | Dummy zero-sized object in the offload entries | |
| | | | | section to force linker to define begin/end | |
| | | | | symbols defined above. | |
| +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| | .omp_offloading.device_image | __tgt_device_image | .omp_offloading_entries | ELF device code object of the first image. | |
| +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| | .omp_offloading.device_image.N | __tgt_device_image | .omp_offloading_entries | ELF device code object of the (N+1)th image. | |
| +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| | .omp_offloading.device_images | __tgt_device_image | .omp_offloading_entries | Array of images. | |
| +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| | .omp_offloading.descriptor | __tgt_bin_desc | .omp_offloading_entries | Binary descriptor object (see :ref:`binary_descriptor`) | |
| +--------------------------------+---------------------+-------------------------+---------------------------------------------------------+ |
| |
| .. _binary_descriptor: |
| |
| Binary Descriptor for Device Images |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| This object is passed to the offloading runtime at program startup and it |
| describes all device images available in the executable or shared library. It |
| is defined as follows: |
| |
| .. code-block:: c |
| |
| __attribute__((visibility("hidden"))) |
| extern __tgt_offload_entry *__start_omp_offloading_entries; |
| __attribute__((visibility("hidden"))) |
| extern __tgt_offload_entry *__stop_omp_offloading_entries; |
| static const char Image0[] = { <Bufs.front() contents> }; |
| ... |
| static const char ImageN[] = { <Bufs.back() contents> }; |
| static const __tgt_device_image Images[] = { |
| { |
| Image0, /*ImageStart*/ |
| Image0 + sizeof(Image0), /*ImageEnd*/ |
| __start_omp_offloading_entries, /*EntriesBegin*/ |
| __stop_omp_offloading_entries /*EntriesEnd*/ |
| }, |
| ... |
| { |
| ImageN, /*ImageStart*/ |
| ImageN + sizeof(ImageN), /*ImageEnd*/ |
| __start_omp_offloading_entries, /*EntriesBegin*/ |
| __stop_omp_offloading_entries /*EntriesEnd*/ |
| } |
| }; |
| static const __tgt_bin_desc BinDesc = { |
| sizeof(Images) / sizeof(Images[0]), /*NumDeviceImages*/ |
| Images, /*DeviceImages*/ |
| __start_omp_offloading_entries, /*HostEntriesBegin*/ |
| __stop_omp_offloading_entries /*HostEntriesEnd*/ |
| }; |
| |
| |
| Global Constructor and Destructor |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The global constructor (``.omp_offloading.descriptor_reg()``) registers the |
| device images with the runtime by calling the ``__tgt_register_lib()`` runtime |
| function. The constructor is explicitly defined in ``.text.startup`` section and |
| is run once when the program starts. Similarly, the global destructor |
| (``.omp_offloading.descriptor_unreg()``) calls ``__tgt_unregister_lib()`` for |
| the destructor and is also defined in ``.text.startup`` section and run when the |
| program exits. |
| |
| Offloading Example |
| ------------------ |
| |
| This section contains a simple example of generating offloading code using |
| OpenMP offloading. We will use a simple ``ZAXPY`` BLAS routine. |
| |
| .. code-block:: c++ |
| |
| #include <complex> |
| |
| using complex = std::complex<double>; |
| |
| void zaxpy(complex *X, complex *Y, complex D, std::size_t N) { |
| #pragma omp target teams distribute parallel for |
| for (std::size_t i = 0; i < N; ++i) |
| Y[i] = D * X[i] + Y[i]; |
| } |
| |
| int main() { |
| const std::size_t N = 1024; |
| complex X[N], Y[N], D; |
| #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N]) |
| zaxpy(X, Y, D, N); |
| } |
| |
| This code is compiled using the following Clang flags. |
| |
| .. code-block:: console |
| |
| $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 zaxpy.cpp -c |
| |
| The output section in the object file can be seen using the ``readelf`` utility. |
| The ``.llvm.offloading`` section has the ``SHF_EXCLUDE`` flag so it will be |
| removed from the final executable or shared library by the linker. |
| |
| .. code-block:: text |
| |
| $ llvm-readelf -WS zaxpy.o |
| Section Headers: |
| [Nr] Name Type Address Off Size ES Flg Lk Inf Al |
| [11] omp_offloading_entries PROGBITS 0000000000000000 0001f0 000040 00 A 0 0 1 |
| [12] .llvm.offloading PROGBITS 0000000000000000 000260 030950 00 E 0 0 8 |
| |
| |
| Compiling this file again will invoke the ``clang-linker-wrapper`` utility to |
| extract and link the device code stored at the section named |
| ``.llvm.offloading`` and then use entries stored in |
| the section named ``omp_offloading_entries`` to create the symbols necessary for |
| ``libomptarget`` to register the device image and call the entry function. |
| |
| .. code-block:: console |
| |
| $ clang++ -fopenmp -fopenmp-targets=nvptx64 zaxpy.o -o zaxpy |
| $ ./zaxpy |
| |
| We can see the steps created by clang to generate the offloading code using the |
| ``-ccc-print-phases`` option in Clang. This matches the description in |
| :ref:`Offloading Overview`. |
| |
| .. code-block:: console |
| |
| $ clang++ -fopenmp -fopenmp-targets=nvptx64 -ccc-print-phases zaxpy.cpp |
| # "x86_64-unknown-linux-gnu" - "clang", inputs: ["zaxpy.cpp"], output: "/tmp/zaxpy-host.bc" |
| # "nvptx64-nvidia-cuda" - "clang", inputs: ["zaxpy.cpp", "/tmp/zaxpy-e6a41b.bc"], output: "/tmp/zaxpy-07f434.s" |
| # "nvptx64-nvidia-cuda" - "NVPTX::Assembler", inputs: ["/tmp/zaxpy-07f434.s"], output: "/tmp/zaxpy-0af7b7.o" |
| # "x86_64-unknown-linux-gnu" - "clang", inputs: ["/tmp/zaxpy-e6a41b.bc", "/tmp/zaxpy-0af7b7.o"], output: "/tmp/zaxpy-416cad.o" |
| # "x86_64-unknown-linux-gnu" - "Offload::Linker", inputs: ["/tmp/zaxpy-416cad.o"], output: "a.out" |
| |
| Relocatable Linking |
| ------------------- |
| |
| The offloading compilation pipeline normally will defer the final device linking |
| and runtime registration until the ``clang-linker-wrapper`` is run to create the |
| executable. This is the standard behaviour when compiling for OpenMP offloading |
| or CUDA and HIP in ``-fgpu-rdc`` mode. However, there are some cases where the |
| user may wish to perform this device handling prematurely. This is described in |
| the :doc:`linker wrapper documentation<ClangLinkerWrapper>`. |
| |
| Effectively, this allows the user to handle offloading specific linking ahead of |
| time when shipping objects or static libraries. This can be thought of as |
| performing a standard ``-fno-gpu-rdc`` compilation on a subset of object files. |
| This can be useful to reduce link time, prevent users from interacting with the |
| library's device code, or for shipping libraries to incompatible compilers. |
| |
| Normally, if a relocatable link is done using ``clang -r`` it will simply merge |
| the ``.llvm.offloading`` sections which will then be linked later when the |
| executable is created. However, if the ``-r`` flag is used with the offloading |
| toolchain, it will perform the device linking and registration phases and then |
| merge the registration code into the final relocatable object file. |
| |
| The following example shows how using the relocatable link with the offloading |
| pipeline can create a static library with offloading code that can be |
| redistributed without requiring any additional handling. |
| |
| .. code-block:: console |
| |
| $ clang++ -fopenmp -fopenmp-targets=nvptx64 foo.cpp -c |
| $ clang++ -lomptarget.devicertl --offload-link -r foo.o -o merged.o |
| $ llvm-ar rcs libfoo.a merged.o |
| # g++ app.cpp -L. -lfoo |
| |