| PyTorch 2.0 Troubleshooting |
| =========================== |
| |
| **Author**: `Michael Lazos <https://github.com/mlazos>`_ |
| |
| |
| .. note:: This document is currently outdated and requires revision. For the interim period, please refer to |
| the `comprehensive manual for torch.compile <https://docs.google.com/document/d/1y5CRfMLdwEoF1nTk9q8qEu1mgMUuUtvhklPKJ2emLU8/edit#heading=h.ivdr7fmrbeab>`__ |
| as the primary resource for troubleshooting guidance. |
| |
| |
| We are actively developing debug tools, profilers, and improving our |
| error and warning messages. Below is a table of the available |
| tools and their typical usage. For additional help see |
| `Diagnosing Runtime Errors <#diagnosing-runtime-errors>`__. |
| |
| .. list-table:: Title |
| :widths: 25 25 50 |
| :header-rows: 1 |
| |
| * - Tool |
| - Purpose |
| - Usage |
| * - Info logging |
| - View summarized steps of compilation |
| - ``torch._logging.set_logs(dynamo = logging.INFO)`` or ``TORCH_LOGS="dynamo"`` |
| * - Debug logging |
| - View detailed steps of compilation (print every instruction traced) |
| - ``torch._logging.set_logs(dynamo = logging.DEBUG)`` and |
| ``torch._dynamo.config.verbose = True``, or ``TORCH_LOGS="+dynamo" TORCHDYNAMO_VERBOSE=1`` |
| * - Minifier for any backend |
| - Find smallest subgraph which reproduces errors for any backend |
| - set environment variable ``TORCHDYNAMO_REPRO_AFTER="dynamo"`` |
| * - Minifier for ``TorchInductor`` |
| - If the error is known to occur after ``AOTAutograd`` find |
| smallest subgraph which reproduces errors during ``TorchInductor`` lowering |
| - set environment variable ``TORCHDYNAMO_REPRO_AFTER="aot"`` |
| * - Dynamo accuracy minifier |
| - Finds the smallest subgraph which reproduces an accuracy issue |
| between an eager mode model and optimized model, when you |
| suspect the problem is in ``AOTAutograd`` |
| - ``TORCHDYNAMO_REPRO_AFTER="dynamo" TORCHDYNAMO_REPRO_LEVEL=4`` |
| * - Inductor accuracy minifier |
| - Finds the smallest subgraph which reproduces an accuracy issue |
| between an eager mode model and optimized model, when you |
| suspect the problem is in the backend (e.g., inductor). |
| If this doesn't work, try the Dynamo accuracy minifier |
| instead. |
| - ``TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4`` |
| * - ``torch._dynamo.explain`` |
| - Find graph breaks and display reasoning for them |
| - ``torch._dynamo.explain(fn)(*inputs)`` |
| * - Record/Replay |
| - Record and replay frames which to reproduce errors during graph capture |
| - ``torch._dynamo.config.replay_record_enabled = True`` |
| * - TorchDynamo function name filtering |
| - Only compile functions with the given name to reduce noise when |
| debugging an issue |
| - set environment variable ``TORCHDYNAMO_DEBUG_FUNCTION=<name>`` |
| * - TorchInductor Debug logging |
| - Print general TorchInductor debug info and generated Triton/C++ code |
| - ``torch._inductor.config.debug = True`` |
| * - TorchInductor Tracing |
| - Show time taken in each TorchInductor stage + output code and graph |
| visualization |
| - set the environment variable TORCH_COMPILE_DEBUG=1 or |
| ``torch._inductor.config.trace.enabled = True`` |
| |
| In addition to info and debug logging, |
| you can use `torch._logging <https://pytorch.org/docs/main/logging.html>`__ |
| for more fine-grained logging. |
| |
| Diagnosing Runtime Errors |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| At a high level, the TorchDynamo stack consists of a graph capture from |
| Python code (TorchDynamo) and a backend compiler. For example, a |
| backend compiler may consist of backward graph tracing (AOTAutograd) and |
| graph lowering (TorchInductor)*. Errors can occur in any component of |
| the stack and will provide full stack traces. |
| |
| To determine in which component an error occurred, |
| you may use info-level logging |
| ``torch._logging.set_logs(dynamo = logging.INFO)`` or ``TORCH_LOGS="dynamo"`` |
| and look for ``Step #: ...`` outputs. Logs are made at the beginning and end of |
| each step, so the step that an error should correspond to is the most recently |
| logged step whose end has not yet been logged. The steps correspond to the |
| following parts of the stack: |
| |
| ==== ================ |
| Step Component |
| ==== ================ |
| 1 TorchDynamo |
| 2 Compiler Backend |
| 3 TorchInductor |
| ==== ================ |
| |
| If info logging is insufficient, you can use available backend |
| options. These options include: |
| |
| - ``"eager"``: only runs TorchDynamo forward graph capture and then |
| runs the captured graph with PyTorch. This provides an indication as |
| to whether TorchDynamo is raising the error. |
| |
| - ``"aot_eager"``: runs TorchDynamo to capture a forward graph, and |
| then AOTAutograd to trace the backward graph without any additional |
| backend compiler steps. PyTorch eager will then be used to run the |
| forward and backward graphs. This is useful to narrow down the issue |
| to AOTAutograd. |
| |
| The general procedure to narrow down an issue is the following: |
| |
| 1. Run your program with the ``"eager"`` backend. If the error no longer |
| occurs, the issue is in the backend compiler that is being used (if |
| using TorchInductor, proceed to step 2. If not, see `this |
| section <#minifying-backend-compiler-errors>`__). If the error still |
| occurs with the ``"eager"`` backend, it is an `error while running |
| torchdynamo <#torchdynamo-errors>`__. |
| |
| 2. This step is only necessary if ``TorchInductor`` is used as the backend |
| compiler. Run the model with the ``"aot_eager"`` backend. If this |
| backend raises an error then the error is occurring during |
| AOTAutograd tracing. If the error no longer occurs with this backend, |
| then `the error is in |
| TorchInductor\* <#minifying-torchinductor-errors>`__. |
| |
| Each of these cases are analyzed in the following sections. |
| |
| .. note:: The TorchInductor backend consists of |
| both AOTAutograd tracing and the TorchInductor compiler itself. We will |
| disambiguate by referring to ``TorchInductor`` as the backend, and |
| TorchInductor lowering as the phase which lowers the graph traced by |
| AOTAutograd. |
| |
| Torchdynamo Errors |
| ------------------ |
| |
| If the error that is generated occurs with the ``"eager"`` backend, then |
| TorchDynamo is most likely the source of the error. Here is a sample code |
| which will generate an error. |
| |
| .. code-block:: py |
| |
| import torch |
| |
| import torch._dynamo as dynamo |
| |
| |
| def test_assertion_error(): |
| y = torch.ones(200, 200) |
| z = {y: 5} |
| return z |
| |
| compiled_test_assertion_error = torch.compile(test_assertion_error, backend="eager") |
| |
| compiled_test_assertion_error() |
| |
| The code above generates the following error: |
| |
| :: |
| |
| torch._dynamo.convert_frame: [ERROR] WON'T CONVERT test_assertion_error /scratch/mlazos/torchdynamo/../test/errors.py line 26 |
| due to: |
| Traceback (most recent call last): |
| File "/scratch/mlazos/torchdynamo/torchdynamo/symbolic_convert.py", line 837, in BUILD_MAP |
| assert isinstance(k, ConstantVariable) or ( |
| AssertionError |
| |
| from user code: |
| File "/scratch/mlazos/torchdynamo/../test/errors.py", line 34, in test_assertion_error |
| z = {y: 5} |
| |
| Set torch._dynamo.config.verbose=True for more information |
| ========== |
| |
| As the message suggests you can set |
| ``torch._dynamo.config.verbose=True`` to get a full stack trace to both |
| the error in TorchDynamo and the user code. In addition to this flag, |
| you can also set the ``log_level`` of TorchDynamo through |
| ``torch._logging.set_logs(dynamo = logging.INFO)`` or ``TORCH_LOGS="dynamo"``. These levels include: |
| |
| - ``logging.DEBUG`` or ``TORCH_LOGS="+dynamo"``: Print every instruction that is |
| encountered in addition to all the log levels listed below. |
| - ``logging.INFO``: |
| Print each function that is compiled (original and modified bytecode) |
| and the graph that is captured in addition to all the log levels listed below. |
| - ``logging.WARNING`` (default): Print graph breaks in addition to all |
| the log levels listed below. |
| - ``logging.ERROR``: Print errors only. |
| |
| If a model is very large, the logs can become overwhelming. If |
| an error occurs deep within a model's Python code, it can be useful to |
| execute only the frame in which the error occurs to enable easier |
| debugging. There are two tools available to enable this: |
| |
| - Setting the environment variable ``TORCHDYNAMO_DEBUG_FUNCTION`` |
| to the desired function name will only run torchdynamo on functions with that |
| name. |
| |
| - Enabling the record/replay tool (set ``torch._dynamo.config.replay_record_enabled = True``) |
| which dumps an execution record when an error is encountered. This record can |
| then be replayed to run only the frame where an error occurred. |
| |
| Diagnosing TorchInductor Errors |
| ------------------------------- |
| |
| If the error does not occur with the ``"eager"`` backend, then the |
| backend compiler is the source of the error (`example |
| error <https://gist.github.com/mlazos/2f13681e3cc6c43b3911f336327032de%5D>`__). |
| There are `different choices <./torch.compiler.rst>`__ |
| for backend compilers for TorchDynamo, with TorchInductor |
| fitting the needs of most users. This section focuses on TorchInductor |
| as the motivating example, but some tools can also be used with other |
| backend compilers. |
| |
| Below is the portion of the stack which we are focusing on: |
| |
| With TorchInductor as the chosen backend, AOTAutograd is used to |
| generate the backward graph from the forward graph captured by |
| torchdynamo. It is important to note that errors can occur during this |
| tracing and also while TorchInductor lowers the forward and backward |
| graphs to GPU code or C++. A model can often consist of hundreds or |
| thousands of FX nodes, so narrowing the exact nodes where this problem |
| occurred can be very difficult. Fortunately, there are tools available to |
| automatically minify these input graphs to the nodes which are causing |
| the issue. The first step is to determine whether the error occurs |
| during tracing of the backward graph with AOTAutograd or during |
| TorchInductor lowering. As mentioned above in step 2, the |
| ``"aot_eager"`` backend can be used to run only AOTAutograd in isolation |
| without lowering. If the error still occurs with this backend, this |
| indicates that the error is occurring during AOTAutograd tracing. |
| |
| Here is an example: |
| |
| .. code-block:: py |
| |
| import torch |
| |
| import torch._dynamo as dynamo |
| |
| model = torch.nn.Sequential(*[torch.nn.Linear(200, 200) for _ in range(5)]) |
| |
| def test_backend_error(): |
| |
| y = torch.ones(200, 200) |
| x = torch.ones(200, 200) |
| z = x + y |
| a = torch.ops.aten._foobar(z) # dummy function which errors |
| return model(a) |
| |
| |
| compiled_test_backend_error = torch.compile(test_backend_error, backend="inductor") |
| compiled_test_backend_error() |
| |
| Running this should give you this error with a longer stack trace below |
| it: |
| |
| :: |
| |
| Traceback (most recent call last): |
| File "/scratch/mlazos/torchdynamo/torchinductor/graph.py", line 246, in call_function |
| return lowerings[target](*args, **kwargs) |
| File "/scratch/mlazos/torchdynamo/torchinductor/lowering.py", line 185, in wrapped |
| return decomp_fn(*args, **kwargs) |
| File "/scratch/mlazos/torchdynamo/torchinductor/lowering.py", line 810, in _foobar |
| assert False |
| AssertionError |
| ... |
| |
| `error with full stack |
| trace <https://gist.github.com/mlazos/d6947854aa56d686800259a164c62100>`__ |
| |
| If you then change ``torch.compile(backend="inductor")`` to |
| ``torch.compile(backend="aot_eager")``, it will run without error, because |
| `the |
| issue <https://github.com/pytorch/torchdynamo/blob/d09e50fbee388d466b5252a63045643166006f77/torchinductor/lowering.py#:~:text=%23%20This%20shouldn%27t%20be,assert%20False>`__ |
| is in the TorchInductor lowering process, not in AOTAutograd. |
| |
| Minifying TorchInductor Errors |
| ------------------------------ |
| |
| From here, let’s run the minifier to get a minimal repro. Setting the |
| environment variable ``TORCHDYNAMO_REPRO_AFTER="aot"`` (or setting |
| ``torch._dynamo.config.repro_after="aot"`` directly) will generate a |
| Python program which reduces the graph produced by AOTAutograd to the |
| smallest subgraph which reproduces the error. (See below for an example |
| where we minify the graph produced by TorchDynamo) Running the program |
| with this environment variable should show nearly `identical |
| output <https://gist.github.com/mlazos/0458ab828aa403c779fe73c012aa5982>`__, |
| with an additional line indicating where ``minifier_launcher.py`` has |
| been written to. The output directory is configurable by setting |
| ``torch._dynamo.config.base_dir`` to a valid directory name. The final |
| step is to run the minifier and check that it runs successfully. A |
| successful run looks like |
| `this <https://gist.github.com/mlazos/e6ea41ccce68a7b1b8a7a09acb1b206a>`__. |
| If the minifier runs successfully, it generates runnable python code |
| which reproduces the exact error. For our example this is the following |
| code: |
| |
| .. code-block:: python |
| |
| import torch |
| from torch import tensor, device |
| import torch.fx as fx |
| from torch._dynamo.testing import rand_strided |
| from math import inf |
| from torch.fx.experimental.proxy_tensor import make_fx |
| |
| # torch version: 1.13.0a0+gitfddfc44 |
| # torch cuda version: 11.6 |
| # torch git version: fddfc4488afb207971c54ad4bf58130fdc8a4dc5 |
| |
| |
| # CUDA Info: |
| # nvcc: NVIDIA (R) Cuda compiler driver |
| # Copyright (c) 2005-2022 NVIDIA Corporation |
| # Built on Thu_Feb_10_18:23:41_PST_2022 |
| # Cuda compilation tools, release 11.6, V11.6.112 |
| # Build cuda_11.6.r11.6/compiler.30978841_0 |
| |
| # GPU Hardware Info: |
| # NVIDIA A100-SXM4-40GB : 8 |
| |
| from torch.nn import * |
| |
| class Repro(torch.nn.Module): |
| def __init__(self): |
| super().__init__() |
| |
| def forward(self, add): |
| _foobar = torch.ops.aten._foobar.default(add); add = None |
| return (_foobar,) |
| |
| args = [((200, 200), (200, 1), torch.float32, 'cpu')] |
| args = [rand_strided(shape, stride, dtype, device) for shape, stride, dtype, device in args] |
| mod = make_fx(Repro())(*args) |
| from torch._inductor.compile_fx import compile_fx_inner |
| |
| compiled = compile_fx_inner(mod, args) |
| compiled(*args) |
| |
| The ``forward`` method of the ``Repro`` module contains the exact op |
| which causes the issue. When filing an issue, please include any |
| minified repros to aid in debugging. |
| |
| Minifying Backend Compiler Errors |
| --------------------------------- |
| |
| With backend compilers other than TorchInductor the process for finding |
| the subgraph causing the error is nearly identical to the procedure in |
| `errors in TorchInductor <#torchinductor-errors>`__ with one important |
| caveat. Namely, that the minifier will now be run on the graph that is |
| traced by TorchDynamo, not the output graph of AOTAutograd. Let’s walk |
| through an example. |
| |
| .. code-block:: py |
| |
| import torch |
| |
| import torch._dynamo as dynamo |
| |
| model = torch.nn.Sequential(*[torch.nn.Linear(200, 200) for _ in range(5)]) |
| # toy compiler which fails if graph contains relu |
| def toy_compiler(gm: torch.fx.GraphModule, _): |
| for node in gm.graph.nodes: |
| if node.target == torch.relu: |
| assert False |
| |
| return gm |
| |
| |
| def test_backend_error(): |
| y = torch.ones(200, 200) |
| x = torch.ones(200, 200) |
| z = x + y |
| a = torch.relu(z) |
| return model(a) |
| |
| |
| compiled_test_backend_error = torch.compile(test_backend_error, backend=toy_compiler) |
| compiled_test_backend_error() |
| |
| In order to run the code after TorchDynamo has traced the forward graph, |
| you can use the ``TORCHDYNAMO_REPRO_AFTER`` environment variable. Running |
| this program with ``TORCHDYNAMO_REPRO_AFTER="dynamo"`` (or |
| ``torch._dynamo.config.repro_after="dynamo"``) should produce `this |
| output <https://gist.github.com/mlazos/244e3d5b53667e44078e194762c0c92b>`__\ and |
| the following code in ``{torch._dynamo.config.base_dir}/repro.py``. |
| |
| .. note:: The other option for TORCHDYNAMO_REPRO_AFTER is ``"aot"``, which |
| will run the minifier after the backward graph has been generated. |
| |
| .. code-block:: python |
| |
| import torch |
| import torch._dynamo as dynamo |
| from torch import tensor, device |
| import torch.fx as fx |
| from torch._dynamo.testing import rand_strided |
| from math import inf |
| from torch._dynamo.debug_utils import run_fwd_maybe_bwd |
| |
| from torch.nn import * |
| |
| class Repro(torch.nn.Module): |
| def __init__(self): |
| super().__init__() |
| |
| def forward(self, add): |
| relu = torch.relu(add); add = None |
| return (relu,) |
| |
| |
| mod = Repro().cuda() |
| opt_mod = torch.compile(mod, backend="None") |
| |
| |
| args = [((200, 200), (200, 1), torch.float32, 'cpu', False)] |
| args = [rand_strided(sh, st, dt, dev).requires_grad_(rg) for (sh, st, dt, dev, rg) in args] |
| |
| |
| with torch.cuda.amp.autocast(enabled=False): |
| ref = run_fwd_maybe_bwd(mod, args) |
| res = run_fwd_maybe_bwd(opt_mod, args) |
| |
| The minifier successfully reduced the graph to the op that raises the |
| error in ``toy_compiler``. The other difference from the procedure in |
| `TorchInductor Errors <#torchinductor-errors>`__ is that the minifier is |
| automatically run after encountering a backend compiler error. After a |
| successful run, the minifier writes ``repro.py`` to |
| ``torch._dynamo.config.base_dir``. |
| |
| Performance Profiling |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| Accessing TorchDynamo Profiler |
| ------------------------------ |
| |
| TorchDynamo has a built-in stats function for collecting and displaying |
| the time spent in each compilation phase. These stats can be accessed by |
| calling ``torch._dynamo.utils.compile_times()`` after executing |
| Torch._Dynamo. By default, this returns a string representation of the |
| compile times spent in each TorchDynamo function by name. |
| |
| TorchInductor Debugging using TORCH_COMPILE_DEBUG |
| ------------------------------------------------- |
| |
| TorchInductor has a builtin stats and trace function for displaying time |
| spent in each compilation phase, output code, output graph visualization |
| and IR dump. This is a debugging tool designed to make it easier to |
| understand and troubleshoot the internals of TorchInductor. |
| |
| Let's run an example with the following test program (``repro.py``): |
| |
| :: |
| |
| import torch |
| |
| @torch.compile() |
| def test_model(x): |
| model = torch.nn.Sequential( |
| torch.nn.Linear(10, 10), |
| torch.nn.LayerNorm(10), |
| torch.nn.ReLU(), |
| ) |
| return model(x) |
| |
| |
| y = test_model(torch.ones(10, 10)) |
| |
| Setting the environment variable ``TORCH_COMPILE_DEBUG=1`` will cause a |
| debug trace directory to be created, by default this directory will be in the |
| current directory and named torch_compile_debug (this can be overridden in |
| the torchdynamo configuration field ``debug_dir_root`` and also the |
| ``env var TORCH_COMPILE_DEBUG_DIR``). Inside this directory, each run will |
| have a separate folder named with the timestamp and process id of the run: |
| |
| :: |
| |
| $ env TORCH_COMPILE_DEBUG=1 python repro.py |
| $ cd torch_compile_debug |
| $ ls |
| run_2023_03_01_08_20_52_143510-pid_180167 |
| |
| In the run folder there will be a ``torchdynamo`` directory which contains |
| debug logs, and an ``torchinductor`` folder which contains a subfolder for each |
| compiled kernel with inductor debug artifacts. |
| |
| :: |
| |
| $ cd |
| run_2023_03_01_08_20_52_143510-pid_180167 |
| $ ls |
| torchinductor torchdynamo |
| |
| Moving further into the ``torchinductor`` directory, the ``\*.log`` files are |
| logs from the AOT Autograd phase of compilation, ``model__0_forward_1.0`` contains |
| the inductor debug artifacts. |
| |
| :: |
| |
| $ cd torchinductor |
| $ ls |
| aot_model___0_debug.log model__0_forward_1.0 |
| $ cd model__0_forward_1.0 |
| $ ls |
| debug.log fx_graph_readable.py fx_graph_runnable.py fx_graph_transformed.py ir_post_fusion.txt ir_pre_fusion.txt output_code.py |
| |
| Here is a summary of the contents: |
| |
| - ``fx_graph_readable.py`` and ``fx_graph_runnable.py`` are the readable and |
| runnable versions of the ``fx_graph`` received by inductor. |
| - ``fx_graph_transformed.py`` is the fx graph after inductor has run all fx passes. |
| - ``ir\*.txt`` is the inductor ir pre and post fusion. |
| - ``output_code.py`` is the compiled triton kernel for the subgraph. |
| |
| Here are `example debug directory contents |
| <https://gist.github.com/jansel/f4af078791ad681a0d4094adeb844396>`__ |
| for the test program: |
| |
| :: |
| |
| import torch |
| |
| @torch.compile() |
| def test_model(x): |
| model = torch.nn.Sequential( |
| torch.nn.Linear(10, 10), |
| torch.nn.LayerNorm(10), |
| torch.nn.ReLU(), |
| ) |
| return model(x) |
| |
| |
| y = test_model(torch.ones(10, 10)) |
| |
| Each file in that debug trace can be enabled and disabled through |
| ``torch._inductor.config.trace.*``. The profile and the diagram are both |
| disabled by default since they are expensive to generate. |
| |
| A single node in this new debug format looks like: |
| |
| :: |
| |
| buf1: SchedulerNode(ComputedBuffer) |
| buf1.writes = |
| { MemoryDep(name='buf1', index=0, size=()), |
| MemoryDep(name='buf1', index=0, size=(s0,))} |
| buf1.unmet_dependencies = {MemoryDep(name='buf0', index=c0, size=(s0,))} |
| buf1.met_dependencies = {MemoryDep(name='primals_2', index=c0, size=(s0,))} |
| buf1.group.device = cuda:0 |
| buf1.group.iteration = (1, s0) |
| buf1.sizes = ([], [s0]) |
| class buf1_loop_body: |
| var_ranges = {z0: s0} |
| index0 = z0 |
| index1 = 0 |
| def body(self, ops): |
| get_index = self.get_index('index0') |
| load = ops.load('buf0', get_index, False) |
| get_index_1 = self.get_index('index0') |
| load_1 = ops.load('primals_2', get_index_1, False) |
| add = ops.add(load, load_1) |
| get_index_2 = self.get_index('index1') |
| reduction = ops.reduction('buf1', torch.float32, torch.float32, 'sum', get_index_2, add) |
| return reduction |
| |
| See the `example debug directory |
| output <https://gist.github.com/jansel/f4af078791ad681a0d4094adeb844396>`__ |
| for more examples. |
| |
| .. |
| _Memory Profiling |
| ---------------- |
| |
| TBD |
| |
| Graph Breaks |
| ------------ |
| |
| Given a program like this: |
| |
| .. code-block:: python |
| |
| def some_fun(x): |
| ... |
| |
| compiled_fun = torch.compile(some_fun, ...) |
| ... |
| |
| TorchDynamo will attempt to compile all of the torch/tensor operations |
| within some_fun into a single FX graph, but it may fail to capture |
| everything into one graph. |
| |
| Some graph break reasons are insurmountable to TorchDynamo, and can't be |
| easily fixed. - calling into a C extension other than torch is invisible |
| to torchdynamo, and could do arbitrary things without TorchDynamo being |
| able to introduce necessary guards (see :ref:`making-dynamo-sound-guards`) |
| to ensure that the compiled program would be safe to reuse. Graph breaks |
| can hinder performance if the resulting fragments are small. To maximize |
| performance, it's important to have as few graph breaks as possible. |
| |
| Identifying the Cause of a Graph Break |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| To identify all graph breaks in a program and the associated reasons for |
| the breaks, ``torch._dynamo.explain`` can be used. This tool runs |
| TorchDynamo on the supplied function and aggregates the graph breaks |
| that are encountered. Here is an example usage: |
| |
| .. code-block:: python |
| |
| import torch |
| import torch._dynamo as dynamo |
| def toy_example(a, b): |
| x = a / (torch.abs(a) + 1) |
| print("woo") |
| if b.sum() < 0: |
| b = b * -1 |
| return x * b |
| explanation = dynamo.explain(toy_example)(torch.randn(10), torch.randn(10)) |
| print(explanation_verbose) |
| """ |
| Graph Count: 3 |
| Graph Break Count: 2 |
| Op Count: 5 |
| Break Reasons: |
| Break Reason 1: |
| Reason: builtin: print [<class 'torch._dynamo.variables.constant.ConstantVariable'>] False |
| User Stack: |
| <FrameSummary file foo.py, line 5 in toy_example> |
| Break Reason 2: |
| Reason: generic_jump TensorVariable() |
| User Stack: |
| <FrameSummary file foo.py, line 6 in torch_dynamo_resume_in_toy_example_at_5> |
| Ops per Graph: |
| ... |
| Out Guards: |
| ... |
| """ |
| |
| Outputs include: |
| |
| - ``out_guards`` - a list of lists where each sublist contains the guards that must pass to ensure the traced graphs are valid. |
| - ``graphs`` - a list of graph modules which were successfully traced. |
| - ``ops_per_graph`` - a list of lists where each sublist contains the ops that are run in the graph. |
| |
| To throw an error on the first graph break encountered, use the ``fullgraph`` |
| mode. This mode disables TorchDynamo’s Python fallback, and only |
| succeeds if the entire program is convertible into a single graph. Example |
| usage: |
| |
| .. code-block:: python |
| |
| def toy_example(a, b): |
| ... |
| |
| compiled_toy = torch.compile(toy_example, fullgraph=True, backend=<compiler>)(a, b) |
| |
| Excessive Recompilation |
| ----------------------- |
| |
| When TorchDynamo compiles a function (or part of one), it makes certain |
| assumptions about locals and globals in order to allow compiler |
| optimizations, and expresses these assumptions as guards that check |
| particular values at runtime. If any of these guards fail, Dynamo will |
| recompile that function (or part) up to |
| ``torch._dynamo.config.cache_size_limit`` times. If your program is |
| hitting the cache limit, you will first need to determine which guard is |
| failing and what part of your program is triggering it. |
| |
| If your program exhibits a bounded amount of dynamism, you may be able |
| to tune the TorchDynamo cache limit to allow for each variation to be |
| compiled and cached, but if the cache limit is too high you may find the |
| cost of recompilation outweighs any optimization benefits. |
| |
| :: |
| |
| torch._dynamo.config.cache_size_limit = <your desired cache limit> |
| |
| TorchDynamo plans to support many common cases of dynamic tensor shapes, |
| such as varying batch size or sequence length. It does not plan to |
| support rank-dynamism. In the meantime, setting a specific cache limit |
| can be used in coordination with bucketing techniques to achieve an |
| acceptable number of recompilations for some dynamic models. |
| |
| Accuracy Debugging |
| ~~~~~~~~~~~~~~~~~~ |
| |
| Accuracy issues can also be minified if you set the environment variable |
| ``TORCHDYNAMO_REPRO_LEVEL=4``, it operates with a similar git bisect |
| model and a full repro might be something like |
| ``TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4`` the reason |
| we need this is downstream compilers will codegen code whether it’s |
| Triton code or the C++ backend, the numerics from those downstream |
| compilers can be different in subtle ways yet have dramatic impact on |
| your training stability. So the accuracy debugger is very useful for us |
| to detect bugs in our codegen or with a backend compiler. |
| |
| If you'd like to ensure that random number generation is the same across both torch |
| and triton then you can enable ``torch._inductor.config.fallback_random = True`` |
| |
| Extended Debugging |
| ~~~~~~~~~~~~~~~~~~ |
| |
| Extended debugging can be enabled by using the following experimental flags. |
| |
| ``TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED`` - provides extended debug information if the |
| string representation of a guard matches this flag value. For example, set it to |
| "Ne(s0, 10)" to generate full Python and C++ backtrace whenever guard was issued. |
| ``TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`` - provides extended debug information when |
| a particular symbol is allocated. For example, set this to "u2" to generate full Python |
| and C++ backtrace whenever this symbol was created. |
| ``TORCHDYNAMO_EXTENDED_DEBUG_CPP`` - provides extended debug information (C++ backtrace) |
| for all extended debug settings as well as errors. For example, set this to "1". The C++ |
| backtrace is slow and very spammy so it is not included by default with extended debugging. |
| |
| Cold Start Timing and Cache Corruption Debugging |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| In order to measure the cold start compilation time or debug a cache corruption, |
| it is possible pass ``TORCHINDUCTOR_FORCE_DISABLE_CACHES=1`` or set |
| ``torch._inductor.config.force_disable_caches = True`` which will override any |
| other caching config option and disable all compile time caching. |