docs/source/distributed.tensor.parallel.rst - platform/external/pytorch - Git at Google

 .. role:: hidden
     :class: hidden-section

 Tensor Parallelism - torch.distributed.tensor.parallel
 ======================================================

 Tensor Parallelism(TP) is built on top of the PyTorch DistributedTensor
 (`DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md>`__)
 and provides different parallelism styles: Colwise, Rowwise, and Sequence Parallelism.

 .. warning ::
     Tensor Parallelism APIs are experimental and subject to change.

 The entrypoint to parallelize your ``nn.Module`` using Tensor Parallelism is:

 .. automodule:: torch.distributed.tensor.parallel

 .. currentmodule:: torch.distributed.tensor.parallel

 .. autofunction::  parallelize_module

 Tensor Parallelism supports the following parallel styles:

 .. autoclass:: torch.distributed.tensor.parallel.ColwiseParallel
   :members:
   :undoc-members:

 .. autoclass:: torch.distributed.tensor.parallel.RowwiseParallel
   :members:
   :undoc-members:

 .. autoclass:: torch.distributed.tensor.parallel.SequenceParallel
   :members:
   :undoc-members:

 To simply configure the nn.Module's inputs and outputs with DTensor layouts
 and perform necessary layout redistributions, without distribute the module
 parameters to DTensors, the following ``ParallelStyle`` s can be used in
 the ``parallelize_plan`` when calling ``parallelize_module``:

 .. autoclass:: torch.distributed.tensor.parallel.PrepareModuleInput
   :members:
   :undoc-members:

 .. autoclass:: torch.distributed.tensor.parallel.PrepareModuleOutput
   :members:
   :undoc-members:

 .. note:: when using the ``Shard(dim)`` as the input/output layouts for the above
   ``ParallelStyle`` s, we assume the input/output activation tensors are evenly sharded on
   the tensor dimension ``dim`` on the ``DeviceMesh`` that TP operates on. For instance,
   since ``RowwiseParallel`` accepts input that is sharded on the last dimension, it assumes
   the input tensor has already been evenly sharded on the last dimension. For the case of uneven
   sharded activation tensors, one could pass in DTensor directly to the partitioned modules,
   and use ``use_local_output=False`` to return DTensor after each ``ParallelStyle``, where
   DTensor could track the uneven sharding information.

 For models like Transformer, we recommend users to use ``ColwiseParallel``
 and ``RowwiseParallel`` together in the parallelize_plan for achieve the desired
 sharding for the entire model (i.e. Attention and MLP).

 Parallelized cross-entropy loss computation (loss parallelism), is supported via the following context manager:

 .. autofunction:: torch.distributed.tensor.parallel.loss_parallel

 .. warning ::
     The loss_parallel API is experimental and subject to change.
	.. role:: hidden
	:class: hidden-section

	Tensor Parallelism - torch.distributed.tensor.parallel
	======================================================

	Tensor Parallelism(TP) is built on top of the PyTorch DistributedTensor
	(`DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md>`__)
	and provides different parallelism styles: Colwise, Rowwise, and Sequence Parallelism.

	.. warning ::
	Tensor Parallelism APIs are experimental and subject to change.

	The entrypoint to parallelize your ``nn.Module`` using Tensor Parallelism is:

	.. automodule:: torch.distributed.tensor.parallel

	.. currentmodule:: torch.distributed.tensor.parallel

	.. autofunction:: parallelize_module

	Tensor Parallelism supports the following parallel styles:

	.. autoclass:: torch.distributed.tensor.parallel.ColwiseParallel
	:members:
	:undoc-members:

	.. autoclass:: torch.distributed.tensor.parallel.RowwiseParallel
	:members:
	:undoc-members:

	.. autoclass:: torch.distributed.tensor.parallel.SequenceParallel
	:members:
	:undoc-members:

	To simply configure the nn.Module's inputs and outputs with DTensor layouts
	and perform necessary layout redistributions, without distribute the module
	parameters to DTensors, the following ``ParallelStyle`` s can be used in
	the ``parallelize_plan`` when calling ``parallelize_module``:

	.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleInput
	:members:
	:undoc-members:

	.. autoclass:: torch.distributed.tensor.parallel.PrepareModuleOutput
	:members:
	:undoc-members:

	.. note:: when using the ``Shard(dim)`` as the input/output layouts for the above
	``ParallelStyle`` s, we assume the input/output activation tensors are evenly sharded on
	the tensor dimension ``dim`` on the ``DeviceMesh`` that TP operates on. For instance,
	since ``RowwiseParallel`` accepts input that is sharded on the last dimension, it assumes
	the input tensor has already been evenly sharded on the last dimension. For the case of uneven
	sharded activation tensors, one could pass in DTensor directly to the partitioned modules,
	and use ``use_local_output=False`` to return DTensor after each ``ParallelStyle``, where
	DTensor could track the uneven sharding information.

	For models like Transformer, we recommend users to use ``ColwiseParallel``
	and ``RowwiseParallel`` together in the parallelize_plan for achieve the desired
	sharding for the entire model (i.e. Attention and MLP).

	Parallelized cross-entropy loss computation (loss parallelism), is supported via the following context manager:

	.. autofunction:: torch.distributed.tensor.parallel.loss_parallel

	.. warning ::
	The loss_parallel API is experimental and subject to change.