docs/source/distributed.checkpoint.rst - platform/external/pytorch - Git at Google

 .. role:: hidden
     :class: hidden-section

 Distributed Checkpoint - torch.distributed.checkpoint
 =====================================================


 Distributed Checkpoint (DCP) support loading and saving models from multiple ranks in parallel.
 It handles load-time resharding which enables saving in one cluster topology and loading into another.

 DCP is different than `torch.save` and `torch.load` in a few significant ways:

 * It produces multiple files per checkpoint, with at least one per rank.
 * It operates in place, meaning that the model should allocate its data first and DCP uses that storage instead.

 The entrypoints to load and save a checkpoint are the following:


 .. automodule:: torch.distributed.checkpoint

 .. currentmodule:: torch.distributed.checkpoint

 .. autofunction::  load
 .. autofunction::  save
 .. autofunction::  load_state_dict
 .. autofunction::  save_state_dict

 In addition to the above entrypoints, `Stateful` objects, as described below, provide additional customization during saving/loading
 .. automodule:: torch.distributed.checkpoint.stateful

 .. autoclass:: torch.distributed.checkpoint.stateful.Stateful
   :members:

 This `example <https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py>`_ shows how to use Pytorch Distributed Checkpoint to save a FSDP model.

 The following types define the IO interface used during checkpoint:

 .. autoclass:: torch.distributed.checkpoint.StorageReader
   :members:

 .. autoclass:: torch.distributed.checkpoint.StorageWriter
   :members:

 The following types define the planner interface used during checkpoint:

 .. autoclass:: torch.distributed.checkpoint.LoadPlanner
   :members:

 .. autoclass:: torch.distributed.checkpoint.LoadPlan
   :members:

 .. autoclass:: torch.distributed.checkpoint.ReadItem
   :members:

 .. autoclass:: torch.distributed.checkpoint.SavePlanner
   :members:

 .. autoclass:: torch.distributed.checkpoint.SavePlan
   :members:

 .. autoclass:: torch.distributed.checkpoint.planner.WriteItem
   :members:

 We provide a filesystem based storage layer:

 .. autoclass:: torch.distributed.checkpoint.FileSystemReader
   :members:

 .. autoclass:: torch.distributed.checkpoint.FileSystemWriter
   :members:

 We provide default implementations of `LoadPlanner` and `SavePlanner` that
 can handle all of torch.distributed constructs such as FSDP, DDP, ShardedTensor and DistributedTensor.

 .. autoclass:: torch.distributed.checkpoint.DefaultSavePlanner
   :members:

 .. autoclass:: torch.distributed.checkpoint.DefaultLoadPlanner
   :members:

 We provide a set of APIs to help users do get and set state_dict easily. This is
 an experimental feature and is subject to change.

 .. autofunction:: torch.distributed.checkpoint.state_dict.get_state_dict

 .. autofunction:: torch.distributed.checkpoint.state_dict.get_model_state_dict

 .. autofunction:: torch.distributed.checkpoint.state_dict.get_optimizer_state_dict

 .. autofunction:: torch.distributed.checkpoint.state_dict.set_state_dict

 .. autofunction:: torch.distributed.checkpoint.state_dict.set_model_state_dict

 .. autofunction:: torch.distributed.checkpoint.state_dict.set_optimizer_state_dict

 .. autoclass:: torch.distributed.checkpoint.state_dict.StateDictOptions
    :members:
	.. role:: hidden
	:class: hidden-section

	Distributed Checkpoint - torch.distributed.checkpoint
	=====================================================


	Distributed Checkpoint (DCP) support loading and saving models from multiple ranks in parallel.
	It handles load-time resharding which enables saving in one cluster topology and loading into another.

	DCP is different than `torch.save` and `torch.load` in a few significant ways:

	* It produces multiple files per checkpoint, with at least one per rank.
	* It operates in place, meaning that the model should allocate its data first and DCP uses that storage instead.

	The entrypoints to load and save a checkpoint are the following:


	.. automodule:: torch.distributed.checkpoint

	.. currentmodule:: torch.distributed.checkpoint

	.. autofunction:: load
	.. autofunction:: save
	.. autofunction:: load_state_dict
	.. autofunction:: save_state_dict

	In addition to the above entrypoints, `Stateful` objects, as described below, provide additional customization during saving/loading
	.. automodule:: torch.distributed.checkpoint.stateful

	.. autoclass:: torch.distributed.checkpoint.stateful.Stateful
	:members:

	This `example <https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py>`_ shows how to use Pytorch Distributed Checkpoint to save a FSDP model.

	The following types define the IO interface used during checkpoint:

	.. autoclass:: torch.distributed.checkpoint.StorageReader
	:members:

	.. autoclass:: torch.distributed.checkpoint.StorageWriter
	:members:

	The following types define the planner interface used during checkpoint:

	.. autoclass:: torch.distributed.checkpoint.LoadPlanner
	:members:

	.. autoclass:: torch.distributed.checkpoint.LoadPlan
	:members:

	.. autoclass:: torch.distributed.checkpoint.ReadItem
	:members:

	.. autoclass:: torch.distributed.checkpoint.SavePlanner
	:members:

	.. autoclass:: torch.distributed.checkpoint.SavePlan
	:members:

	.. autoclass:: torch.distributed.checkpoint.planner.WriteItem
	:members:

	We provide a filesystem based storage layer:

	.. autoclass:: torch.distributed.checkpoint.FileSystemReader
	:members:

	.. autoclass:: torch.distributed.checkpoint.FileSystemWriter
	:members:

	We provide default implementations of `LoadPlanner` and `SavePlanner` that
	can handle all of torch.distributed constructs such as FSDP, DDP, ShardedTensor and DistributedTensor.

	.. autoclass:: torch.distributed.checkpoint.DefaultSavePlanner
	:members:

	.. autoclass:: torch.distributed.checkpoint.DefaultLoadPlanner
	:members:

	We provide a set of APIs to help users do get and set state_dict easily. This is
	an experimental feature and is subject to change.

	.. autofunction:: torch.distributed.checkpoint.state_dict.get_state_dict

	.. autofunction:: torch.distributed.checkpoint.state_dict.get_model_state_dict

	.. autofunction:: torch.distributed.checkpoint.state_dict.get_optimizer_state_dict

	.. autofunction:: torch.distributed.checkpoint.state_dict.set_state_dict

	.. autofunction:: torch.distributed.checkpoint.state_dict.set_model_state_dict

	.. autofunction:: torch.distributed.checkpoint.state_dict.set_optimizer_state_dict

	.. autoclass:: torch.distributed.checkpoint.state_dict.StateDictOptions
	:members: