| Frequently Asked Questions |
| ========================== |
| |
| My model reports "cuda runtime error(2): out of memory" |
| ------------------------------------------------------- |
| |
| As the error message suggests, you have run out of memory on your |
| GPU. Since we often deal with large amounts of data in PyTorch, |
| small mistakes can rapidly cause your program to use up all of your |
| GPU; fortunately, the fixes in these cases are often simple. |
| Here are a few common things to check: |
| |
| **Don't accumulate history across your training loop.** |
| By default, computations involving variables that require gradients |
| will keep history. This means that you should avoid using such |
| variables in computations which will live beyond your training loops, |
| e.g., when tracking statistics. Instead, you should detach the variable |
| or access its underlying data. |
| |
| Sometimes, it can be non-obvious when differentiable variables can |
| occur. Consider the following training loop (abridged from `source |
| <https://discuss.pytorch.org/t/high-memory-usage-while-training/162>`_): |
| |
| .. code-block:: python |
| |
| total_loss = 0 |
| for i in range(10000): |
| optimizer.zero_grad() |
| output = model(input) |
| loss = criterion(output) |
| loss.backward() |
| optimizer.step() |
| total_loss += loss |
| |
| Here, ``total_loss`` is accumulating history across your training loop, since |
| ``loss`` is a differentiable variable with autograd history. You can fix this by |
| writing `total_loss += float(loss)` instead. |
| |
| Other instances of this problem: |
| `1 <https://discuss.pytorch.org/t/resolved-gpu-out-of-memory-error-with-batch-size-1/3719>`_. |
| |
| **Don't hold onto tensors and variables you don't need.** |
| If you assign a Tensor or Variable to a local, Python will not |
| deallocate until the local goes out of scope. You can free |
| this reference by using ``del x``. Similarly, if you assign |
| a Tensor or Variable to a member variable of an object, it will |
| not deallocate until the object goes out of scope. You will |
| get the best memory usage if you don't hold onto temporaries |
| you don't need. |
| |
| The scopes of locals can be larger than you expect. For example: |
| |
| .. code-block:: python |
| |
| for i in range(5): |
| intermediate = f(input[i]) |
| result += g(intermediate) |
| output = h(result) |
| return output |
| |
| Here, ``intermediate`` remains live even while ``h`` is executing, |
| because its scope extrudes past the end of the loop. To free it |
| earlier, you should ``del intermediate`` when you are done with it. |
| |
| **Avoid running RNNs on sequences that are too large.** |
| The amount of memory required to backpropagate through an RNN scales |
| linearly with the length of the RNN input; thus, you will run out of memory |
| if you try to feed an RNN a sequence that is too long. |
| |
| The technical term for this phenomenon is `backpropagation through time |
| <https://en.wikipedia.org/wiki/Backpropagation_through_time>`_, |
| and there are plenty of references for how to implement truncated |
| BPTT, including in the `word language model <https://github.com/pytorch/examples/tree/master/word_language_model>`_ example; truncation is handled by the |
| ``repackage`` function as described in |
| `this forum post <https://discuss.pytorch.org/t/help-clarifying-repackage-hidden-in-word-language-model/226>`_. |
| |
| **Don't use linear layers that are too large.** |
| A linear layer ``nn.Linear(m, n)`` uses :math:`O(nm)` memory: that is to say, |
| the memory requirements of the weights |
| scales quadratically with the number of features. It is very easy |
| to `blow through your memory <https://github.com/pytorch/pytorch/issues/958>`_ |
| this way (and remember that you will need at least twice the size of the |
| weights, since you also need to store the gradients.) |
| |
| **Consider checkpointing.** |
| You can trade-off memory for compute by using `checkpoint <https://pytorch.org/docs/stable/checkpoint.html>`_. |
| |
| My GPU memory isn't freed properly |
| ---------------------------------- |
| PyTorch uses a caching memory allocator to speed up memory allocations. As a |
| result, the values shown in ``nvidia-smi`` usually don't reflect the true |
| memory usage. See :ref:`cuda-memory-management` for more details about GPU |
| memory management. |
| |
| If your GPU memory isn't freed even after Python quits, it is very likely that |
| some Python subprocesses are still alive. You may find them via |
| ``ps -elf | grep python`` and manually kill them with ``kill -9 [pid]``. |
| |
| My out of memory exception handler can't allocate memory |
| -------------------------------------------------------- |
| You may have some code that tries to recover from out of memory errors. |
| |
| .. code-block:: python |
| |
| try: |
| run_model(batch_size) |
| except RuntimeError: # Out of memory |
| for _ in range(batch_size): |
| run_model(1) |
| |
| But find that when you do run out of memory, your recovery code can't allocate |
| either. That's because the python exception object holds a reference to the |
| stack frame where the error was raised. Which prevents the original tensor |
| objects from being freed. The solution is to move you OOM recovery code outside |
| of the ``except`` clause. |
| |
| .. code-block:: python |
| |
| oom = False |
| try: |
| run_model(batch_size) |
| except RuntimeError: # Out of memory |
| oom = True |
| |
| if oom: |
| for _ in range(batch_size): |
| run_model(1) |
| |
| |
| .. _dataloader-workers-random-seed: |
| |
| My data loader workers return identical random numbers |
| ------------------------------------------------------- |
| You are likely using other libraries to generate random numbers in the dataset |
| and worker subprocesses are started via ``fork``. See |
| :class:`torch.utils.data.DataLoader`'s documentation for how to |
| properly set up random seeds in workers with its :attr:`worker_init_fn` option. |
| |
| .. _pack-rnn-unpack-with-data-parallelism: |
| |
| My recurrent network doesn't work with data parallelism |
| ------------------------------------------------------- |
| There is a subtlety in using the |
| ``pack sequence -> recurrent network -> unpack sequence`` pattern in a |
| :class:`~torch.nn.Module` with :class:`~torch.nn.DataParallel` or |
| :func:`~torch.nn.parallel.data_parallel`. Input to each the :meth:`forward` on |
| each device will only be part of the entire input. Because the unpack operation |
| :func:`torch.nn.utils.rnn.pad_packed_sequence` by default only pads up to the |
| longest input it sees, i.e., the longest on that particular device, size |
| mismatches will happen when results are gathered together. Therefore, you can |
| instead take advantage of the :attr:`total_length` argument of |
| :func:`~torch.nn.utils.rnn.pad_packed_sequence` to make sure that the |
| :meth:`forward` calls return sequences of same length. For example, you can |
| write:: |
| |
| from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence |
| |
| class MyModule(nn.Module): |
| # ... __init__, other methods, etc. |
| |
| # padded_input is of shape [B x T x *] (batch_first mode) and contains |
| # the sequences sorted by lengths |
| # B is the batch size |
| # T is max sequence length |
| def forward(self, padded_input, input_lengths): |
| total_length = padded_input.size(1) # get the max sequence length |
| packed_input = pack_padded_sequence(padded_input, input_lengths, |
| batch_first=True) |
| packed_output, _ = self.my_lstm(packed_input) |
| output, _ = pad_packed_sequence(packed_output, batch_first=True, |
| total_length=total_length) |
| return output |
| |
| |
| m = MyModule().cuda() |
| dp_m = nn.DataParallel(m) |
| |
| |
| Additionally, extra care needs to be taken when batch dimension is dim ``1`` |
| (i.e., ``batch_first=False``) with data parallelism. In this case, the first |
| argument of pack_padded_sequence ``padding_input`` will be of shape |
| ``[T x B x *]`` and should be scattered along dim ``1``, but the second argument |
| ``input_lengths`` will be of shape ``[B]`` and should be scattered along dim |
| ``0``. Extra code to manipulate the tensor shapes will be needed. |