| Dynamo Overview |
| =============== |
| |
| Before you read this section, read :ref:`torch.compiler_overview`. |
| |
| TorchDynamo (or simply Dynamo) is a Python-level Just-In-Time (JIT) compiler designed to make |
| unmodified PyTorch programs faster. Dynamo hooks into the frame evaluation |
| API in CPython (`PEP 523 <https://peps.python.org/pep-0523/>`__) to |
| dynamically modify Python bytecode right before it is executed. It |
| rewrites Python bytecode to extract sequences of PyTorch |
| operations into an `FX Graph <https://pytorch.org/docs/stable/fx.html>`__ |
| which is then compiled with a customizable backend. |
| It creates this FX Graph through bytecode analysis and is designed to |
| mix Python execution with compiled backends to get the best of both |
| worlds — usability and performance. |
| |
| Dynamo makes it easy to experiment with different compiler |
| backends to make PyTorch code faster with a single line decorator |
| ``torch._dynamo.optimize()`` which is wrapped for convenience by ``torch.compile()`` |
| |
| The following diagram demonstrates how PyTorch works with ``torch.compile`` |
| and without it: |
| |
| .. image:: _static/img/dynamo/TorchDynamo.png |
| |
| `TorchInductor` is one of the backends |
| supported by `Dynamo Graph <https://pytorch.org/docs/stable/fx.html>`__ |
| into `Triton <https://github.com/openai/triton>`__ for GPUs or |
| `C++/OpenMP <https://www.openmp.org/>`__ for CPUs. We have a |
| `training performance dashboard <https://github.com/pytorch/torchdynamo/issues/681#issuecomment-1233828468>`__ |
| that provides performance comparison for different training backends. You can read |
| more in the `TorchInductor post on PyTorch |
| dev-discuss <https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747>`__. |
| |
| For an in-depth overview, read the sections below, watch the deep-dive video, |
| and check out the dev-discuss topics. |
| |
| * `Dynamo deep-dive video <https://www.youtube.com/watch?v=egZB5Uxki0I>`__ |
| * `dev-discuss topics <https://dev-discuss.pytorch.org/search?q=TorchDynamo%20order%3Alatest>`__ |
| |
| Dynamo Internals |
| ~~~~~~~~~~~~~~~~ |
| **Author**: `Jason Ansel <https://github.com/jansel>`_ and `Kaichao You <https://github.com/youkaichao>`_ |
| |
| This section will go over some of the Dynamo internals and will |
| demonstrate how Dynamo works under the hood. |
| |
| What is a guard? |
| ---------------- |
| |
| Dynamo operates just-in-time and specializes graphs based on |
| dynamic properties. Below is a basic example of how to use Dynamo. |
| One can decorate a function or a method using ``torchdynamo.optimize`` to enable |
| Dynamo optimization: |
| |
| .. code-block:: python |
| |
| from typing import List |
| import torch |
| from torch import _dynamo as torchdynamo |
| def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]): |
| print("my_compiler() called with FX graph:") |
| gm.graph.print_tabular() |
| return gm.forward # return a python callable |
| |
| @torchdynamo.optimize(my_compiler) |
| def toy_example(a, b): |
| x = a / (torch.abs(a) + 1) |
| if b.sum() < 0: |
| b = b * -1 |
| return x * b |
| for _ in range(100): |
| toy_example(torch.randn(10), torch.randn(10)) |
| |
| For example, the first graph above has the following |
| guards: |
| |
| :: |
| |
| GUARDS: |
| hasattr(L['a'], '_dynamo_dynamic_indices') == False |
| hasattr(L['b'], '_dynamo_dynamic_indices') == False |
| utils_device.CURRENT_DEVICE == None |
| ___skip_backend_check() or ___current_backend() == ___lookup_backend(140355900538256) |
| check_tensor(L['a'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[10], stride=[1]) |
| check_tensor(L['b'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[10], stride=[1]) |
| |
| If any of those guards fail, the graph will be recaptured and |
| recompiled. The interesting guard there is ``check_tensor``, which |
| checks the following ``torch.Tensor`` properties: |
| |
| - Python class of the tensor (tensor subclassing, etc) |
| - dtype |
| - device |
| - requires_grad |
| - dispatch_key (with thread-local includes/excludes applied) |
| - ndim |
| - sizes\* |
| - strides\* |
| |
| The full specialization mode allows the backend compiler to assume an |
| entirely static graph. Unfortunately, most backends require this. |
| Operators which return dynamic shapes will trigger a graph break when |
| not in dynamic shape mode. |
| |
| What is Dynamo doing? |
| --------------------- |
| |
| If you want to understand better what Dynamo is doing, you can run your code with: |
| |
| :: |
| |
| TORCH_LOGS="+dynamo,guards,bytecode" |
| |
| If you are not familiar with Python bytecode, you can add a decompiler hook |
| to decompile the bytecode into human-readable source code. One available |
| tool is `depyf <https://github.com/youkaichao/depyf>`__. If you don't have |
| ``depyf`` already installed, run ``pip install depyf``. Then, add the |
| following code to install decompilation hooks before you run any code. |
| |
| .. code-block:: python |
| |
| import depyf |
| depyf.install() |
| |
| This code triggers useful (but spammy) printouts. |
| |
| For example, the printouts for the first graph in the ``toy_example`` |
| are: |
| |
| :: |
| |
| __compiled_fn_0 <eval_with_key>.1 |
| opcode name target args kwargs |
| ------------- ------- ------------------------------------------------------ ---------------- -------- |
| placeholder a a () {} |
| placeholder b b () {} |
| call_function abs_1 <built-in method abs of type object at 0x7f9ca082f8a0> (a,) {} |
| call_function add <built-in function add> (abs_1, 1) {} |
| call_function truediv <built-in function truediv> (a, add) {} |
| call_method sum_1 sum (b,) {} |
| call_function lt <built-in function lt> (sum_1, 0) {} |
| output output output ((truediv, lt),) {} |
| |
| ORIGINAL BYTECODE toy_example example.py line 12 |
| 14 0 LOAD_FAST 0 (a) |
| 2 LOAD_GLOBAL 0 (torch) |
| 4 LOAD_METHOD 1 (abs) |
| 6 LOAD_FAST 0 (a) |
| 8 CALL_METHOD 1 |
| 10 LOAD_CONST 1 (1) |
| 12 BINARY_ADD |
| 14 BINARY_TRUE_DIVIDE |
| 16 STORE_FAST 2 (x) |
| |
| 15 18 LOAD_FAST 1 (b) |
| 20 LOAD_METHOD 2 (sum) |
| 22 CALL_METHOD 0 |
| 24 LOAD_CONST 2 (0) |
| 26 COMPARE_OP 0 (<) |
| 28 POP_JUMP_IF_FALSE 19 (to 38) |
| |
| 16 30 LOAD_FAST 1 (b) |
| 32 LOAD_CONST 3 (-1) |
| 34 BINARY_MULTIPLY |
| 36 STORE_FAST 1 (b) |
| |
| 17 >> 38 LOAD_FAST 2 (x) |
| 40 LOAD_FAST 1 (b) |
| 42 BINARY_MULTIPLY |
| 44 RETURN_VALUE |
| |
| |
| MODIFIED BYTECODE toy_example example.py line 12 |
| 12 0 LOAD_GLOBAL 3 (__compiled_fn_0) |
| 2 LOAD_FAST 0 (a) |
| 4 LOAD_FAST 1 (b) |
| 6 CALL_FUNCTION 2 |
| 8 UNPACK_SEQUENCE 2 |
| 10 STORE_FAST 2 (x) |
| 12 POP_JUMP_IF_FALSE 12 (to 24) |
| 14 LOAD_GLOBAL 4 (__resume_at_30_1) |
| 16 LOAD_FAST 1 (b) |
| 18 LOAD_FAST 2 (x) |
| 20 CALL_FUNCTION 2 |
| 22 RETURN_VALUE |
| >> 24 LOAD_GLOBAL 5 (__resume_at_38_2) |
| 26 LOAD_FAST 1 (b) |
| 28 LOAD_FAST 2 (x) |
| 30 CALL_FUNCTION 2 |
| 32 RETURN_VALUE |
| |
| |
| possible source code: |
| def toy_example(a, b): |
| __temp_1 = __compiled_fn_0(a, b) |
| x = __temp_1[0] |
| if __temp_1[1]: |
| return __resume_at_30_1(b, x) |
| return __resume_at_38_2(b, x) |
| |
| If you find the decompiled code is wrong,please submit an issue at https://github.com/youkaichao/depyf/issues. |
| |
| At the top you can see the FX graph. |
| Next, you see the original bytecode of the function, followed by the |
| modified bytecode generated by Dynamo, and the decompiled source |
| code for reference. Finally, you see the guards which we covered above. |
| |
| In the modified bytecode, ``__compiled_fn_0`` is the return value of |
| ``my_compiler()`` (the compiled graph). ``__resume_at_30_1`` and |
| ``__resume_at_38_2`` are both generated continuation functions that pick |
| up execution after a graph break (at bytecode offsets 30 and 38). Each |
| of these functions take the form: |
| |
| :: |
| |
| __resume_at_<offset>: |
| ... restore stack state if needed ... |
| JUMP_ABSOLUTE <offset> into toy_example |
| ... original bytecode of toy_example ... |
| |
| By generating this ``resume_at`` function, we force the remainder of the |
| function to be executed in a new Python frame which recursively |
| triggers Dynamo to restart its capture once execution reaches that |
| point for the first time. |
| |
| How to inspect artifacts generated by Dynamo? |
| --------------------------------------------- |
| |
| To inspect the artifacts generated by Dynamo, there is an API ``torch._dynamo.eval_frame._debug_get_cache_entry_list`` that retrieves compiled code and guards out of a function's ``__code__`` object. A compiled function can have several cache entries, and each cache entry consists a generated function to check guards, and a ``types.CodeType`` object to keep the code to be executed if the guarding conditions are satisfied. |
| |
| .. code-block:: python |
| |
| from torch._dynamo.eval_frame import _debug_get_cache_entry_list, innermost_fn |
| cache_entries = _debug_get_cache_entry_list(innermost_fn(toy_example)) |
| cache_entry = cache_entries[0] |
| guard, code = cache_entry.check_fn, cache_entry.code |
| # the guard takes the local variables of an input frame, and tells whether a re-compilation should be triggered. |
| import dis |
| dis.dis(guard) |
| dis.dis(code) |
| |
| If you know Python bytecode, you can understand the above output. |
| |
| For the guard function, there is no need to inspect the bytecode. We can directly access its guarding conditions: |
| |
| .. code-block:: python |
| |
| for code_part in guard.code_parts: |
| print(code_part) |
| |
| The output is: |
| |
| :: |
| |
| ___guarded_code.valid |
| ___check_global_state() |
| hasattr(L['a'], '_dynamo_dynamic_indices') == False |
| hasattr(L['b'], '_dynamo_dynamic_indices') == False |
| utils_device.CURRENT_DEVICE == None |
| ___skip_backend_check() or ___current_backend() == ___lookup_backend(140215810860528) |
| ___check_tensors(L['a'], L['b'], tensor_check_names=tensor_check_names) |
| |
| Only when all the conditions are satisfied, the guard function returns true, and the compiled code is executed. |
| |
| For the compiled code, we cannot directly access its source but have to decompile it. |
| |
| .. code-block:: python |
| |
| from depyf import decompile |
| print(decompile(code)) |
| |
| The output is: |
| |
| :: |
| |
| def toy_example(a, b): |
| __temp_1 = __compiled_fn_0(a, b) |
| x = __temp_1[0] |
| if __temp_1[1]: |
| return __resume_at_30_1(b, x) |
| return __resume_at_38_2(b, x) |
| |
| Some names referenced in the code are: |
| |
| - Compiled functions, stored in the global namespace of the module containing the original function ``toy_example``. These include names like ``__compiled_fn_0`` / ``__resume_at_30_1`` / ``__resume_at_38_2``. |
| |
| - Closure variables used for checking guards. The names can be accessed from ``guard.__code__.co_freevars``, and the values are stored in ``guard.__closure__``. These include names like ``___guarded_code`` / ``___is_grad_enabled`` / ``___are_deterministic_algorithms_enabled`` / ``___is_torch_function_enabled`` / ``utils_device`` / ``___check_tensors`` / ``tensor_check_names``. |
| |
| - Argument ``L`` of the ``guard`` function. This is a dict mapping the name of arguments of ``toy_example`` to its values. This is only available when the function is called, where the frame evaluation API comes into play. In short, ``L`` is a ``dict`` with structure of ``{'a': value_a, 'b': value_b}``. Therefore, you can see the code uses ``L['a']`` to refer to the input variable ``a``. |
| |
| The graph break is shown in the code of compiled ``toy_example``, where we have to use Python interpreter to select the following graph to execute. |
| |
| Note that we pass a simple ``my_compiler`` function as the backend compiler, therefore the subgraph code ``__resume_at_38_2``, ``__resume_at_30_1``, and ``__compiled_fn_0`` remain Python code. This can also be inspected (please ignore the function name, and only use the function signature and function body code): |
| |
| .. code-block:: python |
| |
| print("source code of __compiled_fn_0:") |
| print(innermost_fn(__compiled_fn_0).__self__.code) |
| print("=" * 60) |
| print("source code of __resume_at_30_1:") |
| print(decompile(__resume_at_30_1)) |
| print("=" * 60) |
| print("source code of __resume_at_38_2:") |
| print(decompile(__resume_at_38_2)) |
| |
| :: |
| |
| source code of __compiled_fn_0: |
| |
| def forward(self, L_a_ : torch.Tensor, L_b_ : torch.Tensor): |
| l_a_ = L_a_ |
| l_b_ = L_b_ |
| abs_1 = torch.abs(l_a_) |
| add = abs_1 + 1; abs_1 = None |
| truediv = l_a_ / add; l_a_ = add = None |
| sum_1 = l_b_.sum(); l_b_ = None |
| lt = sum_1 < 0; sum_1 = None |
| return (truediv, lt) |
| |
| # To see more debug info, please use ``graph_module.print_readable()`` |
| ============================================================ |
| source code of __resume_at_30_1: |
| def <resume in toy_example>(b, x): |
| b = b * -1 |
| return x * b |
| |
| ============================================================ |
| source code of __resume_at_38_2: |
| def <resume in toy_example>(b, x): |
| return x * b |
| |
| However, if we use other backends like the built-in ``inductor``, the subgraph code will be compiled CUDA kernels for GPU or C++ code for CPU. |
| |
| To summarize, the compiled code is conceptually equivalent to the code below: |
| |
| .. code-block:: python |
| |
| def compiled_example(a, b): |
| L = {'a': a, 'b': b} |
| for guard, code in get_cache_entries(): |
| if guard(L): |
| return code(a, b) |
| recompile_and_add_another_cache_entry() |
| |
| The following diagram demonstrates how ``torch.compile`` transforms and optimizes user-written code: it first extracts computation graphs from the user-written function, and compiles these graphs into optimized functions, then assembles them into a new function, which is functionally equivalent to the user-written code but optimized to have a good computation speed. |
| |
| .. image:: _static/img/dynamo/flowchart.jpg |
| |
| To learn more about how all this is implemented internally, see :ref:`torch.compiler_dynamo_deepdive`. |