blob: 57400070be3b820cbc12adff33546873662cc68d [file] [log] [blame]
.. _rendezvous-api:
Rendezvous
==========
.. automodule:: torch.distributed.elastic.rendezvous
Below is a state diagram describing how rendezvous works.
.. image:: etcd_rdzv_diagram.png
Registry
--------
.. autoclass:: RendezvousParameters
:members:
.. autoclass:: RendezvousHandlerRegistry
.. automodule:: torch.distributed.elastic.rendezvous.registry
Handler
-------
.. currentmodule:: torch.distributed.elastic.rendezvous
.. autoclass:: RendezvousHandler
:members:
Dataclasses
-----------
.. autoclass:: RendezvousInfo
.. currentmodule:: torch.distributed.elastic.rendezvous.api
.. autoclass:: RendezvousStoreInfo
.. automethod:: build(rank, store)
Exceptions
----------
.. autoclass:: RendezvousError
.. autoclass:: RendezvousClosedError
.. autoclass:: RendezvousTimeoutError
.. autoclass:: RendezvousConnectionError
.. autoclass:: RendezvousStateError
.. autoclass:: RendezvousGracefulExitError
Implementations
---------------
Dynamic Rendezvous
******************
.. currentmodule:: torch.distributed.elastic.rendezvous.dynamic_rendezvous
.. autofunction:: create_handler
.. autoclass:: DynamicRendezvousHandler()
:members: from_backend
.. autoclass:: RendezvousBackend
:members:
.. autoclass:: RendezvousTimeout
:members:
C10d Backend
^^^^^^^^^^^^
.. currentmodule:: torch.distributed.elastic.rendezvous.c10d_rendezvous_backend
.. autofunction:: create_backend
.. autoclass:: C10dRendezvousBackend
:members:
Etcd Backend
^^^^^^^^^^^^
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous_backend
.. autofunction:: create_backend
.. autoclass:: EtcdRendezvousBackend
:members:
Etcd Rendezvous (Legacy)
************************
.. warning::
The ``DynamicRendezvousHandler`` class supersedes the ``EtcdRendezvousHandler``
class, and is recommended for most users. ``EtcdRendezvousHandler`` is in
maintenance mode and will be deprecated in the future.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_rendezvous
.. autoclass:: EtcdRendezvousHandler
Etcd Store
**********
The ``EtcdStore`` is the C10d ``Store`` instance type returned by
``next_rendezvous()`` when etcd is used as the rendezvous backend.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_store
.. autoclass:: EtcdStore
:members:
Etcd Server
***********
The ``EtcdServer`` is a convenience class that makes it easy for you to
start and stop an etcd server on a subprocess. This is useful for testing
or single-node (multi-worker) deployments where manually setting up an
etcd server on the side is cumbersome.
.. warning:: For production and multi-node deployments please consider
properly deploying a highly available etcd server as this is
the single point of failure for your distributed jobs.
.. currentmodule:: torch.distributed.elastic.rendezvous.etcd_server
.. autoclass:: EtcdServer