| .. _pipeline-parallelism: |
| |
| Pipeline Parallelism |
| ==================== |
| |
| Pipeline parallelism was original introduced in the |
| `Gpipe <https://arxiv.org/abs/1811.06965>`__ paper and is an efficient |
| technique to train large models on multiple GPUs. |
| |
| .. warning :: |
| Pipeline Parallelism is experimental and subject to change. |
| |
| Model Parallelism using multiple GPUs |
| ------------------------------------- |
| |
| Typically for large models which don't fit on a single GPU, model parallelism |
| is employed where certain parts of the model are placed on different GPUs. |
| Although, if this is done naively for sequential models, the training process |
| suffers from GPU under utilization since only one GPU is active at one time as |
| shown in the figure below: |
| |
| .. figure:: _static/img/pipeline_parallelism/no_pipe.png |
| |
| The figure represents a model with 4 layers placed on 4 different GPUs |
| (vertical axis). The horizontal axis represents training this model through |
| time demonstrating that only 1 GPU is utilized at a time |
| (`image source <https://arxiv.org/abs/1811.06965>`__). |
| |
| Pipelined Execution |
| ------------------- |
| |
| To alleviate this problem, pipeline parallelism splits the input minibatch into |
| multiple microbatches and pipelines the execution of these microbatches across |
| multiple GPUs. This is outlined in the figure below: |
| |
| .. figure:: _static/img/pipeline_parallelism/pipe.png |
| |
| The figure represents a model with 4 layers placed on 4 different GPUs |
| (vertical axis). The horizontal axis represents training this model through |
| time demonstrating that the GPUs are utilized much more efficiently. |
| However, there still exists a bubble (as demonstrated in the figure) where |
| certain GPUs are not utilized. |
| (`image source <https://arxiv.org/abs/1811.06965>`__). |
| |
| Pipe APIs in PyTorch |
| -------------------- |
| .. autoclass:: torch.distributed.pipeline.sync.Pipe |
| :members: forward |
| |
| Skip connections |
| ^^^^^^^^^^^^^^^^ |
| |
| Certain models like ResNeXt are not completely sequential and have skip |
| connections between layers. Naively implementing as part of pipeline |
| parallelism would imply that we need to copy outputs for certain layers through |
| multiple GPUs till we eventually reach the GPU where the layer for the skip |
| connection resides. To avoid this copy overhead, we provide APIs below to stash |
| and pop Tensors in different layers of the model. |
| |
| .. autofunction:: torch.distributed.pipeline.sync.skip.skippable.skippable |
| .. autoclass:: torch.distributed.pipeline.sync.skip.skippable.stash |
| .. autoclass:: torch.distributed.pipeline.sync.skip.skippable.pop |
| .. autofunction:: torch.distributed.pipeline.sync.skip.skippable.verify_skippables |
| |
| Tutorials |
| --------- |
| |
| The following tutorials give a good overview of how to use the |
| :class:`~torch.distributed.pipeline.sync.Pipe` API to train your models with the |
| rest of the components that PyTorch provides: |
| |
| - `Training Transformer models using Pipeline Parallelism <https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html>`__ |
| - `Training Transformer models using Distributed Data Parallel and Pipeline Parallelism <https://pytorch.org/tutorials/advanced/ddp_pipeline.html>`__ |
| |
| Acknowledgements |
| ---------------- |
| |
| The implementation for pipeline parallelism is based on `fairscale's pipe implementation <https://github.com/facebookresearch/fairscale/tree/main/fairscale/nn/pipe>`__ and |
| `torchgpipe <https://github.com/kakaobrain/torchgpipe>`__. We would like to |
| thank both teams for their contributions and guidance towards bringing pipeline |
| parallelism into PyTorch. |