blob: db1465d71e682fb72270c812da9477cadccd3db9 [file] [log] [blame]
Elastic Agent
==============
.. automodule:: torch.distributed.elastic.agent
.. currentmodule:: torch.distributed.elastic.agent
Server
--------
.. automodule:: torch.distributed.elastic.agent.server
Below is a diagram of an agent that manages a local group of workers.
.. image:: agent_diagram.jpg
Concepts
--------
This section describes the high-level classes and concepts that
are relevant to understanding the role of the ``agent`` in torchelastic.
.. currentmodule:: torch.distributed.elastic.agent.server
.. autoclass:: ElasticAgent
:members:
.. autoclass:: WorkerSpec
:members:
.. autoclass:: WorkerState
:members:
.. autoclass:: Worker
:members:
.. autoclass:: WorkerGroup
:members:
Implementations
-------------------
Below are the agent implementations provided by torchelastic.
.. currentmodule:: torch.distributed.elastic.agent.server.local_elastic_agent
.. autoclass:: LocalElasticAgent
Extending the Agent
---------------------
To extend the agent you can implement ```ElasticAgent`` directly, however
we recommend you extend ``SimpleElasticAgent`` instead, which provides
most of the scaffolding and leaves you with a few specific abstract methods
to implement.
.. currentmodule:: torch.distributed.elastic.agent.server
.. autoclass:: SimpleElasticAgent
:members:
:private-members:
.. autoclass:: torch.distributed.elastic.agent.server.api.RunResult
Watchdog in the Agent
---------------------
A named pipe based watchdog can be enabled in ```LocalElasticAgent``` if an
environment variable ``TORCHELASTIC_ENABLE_FILE_TIMER`` with value 1 has
been defined in the ```LocalElasticAgent``` process.
Optionally, another environment variable ```TORCHELASTIC_TIMER_FILE```
can be set with a unique file name for the named pipe. If the environment
variable ```TORCHELASTIC_TIMER_FILE``` is not set, ```LocalElasticAgent```
will internally create a unique file name and set it to the environment
variable ```TORCHELASTIC_TIMER_FILE```, and this environment variable will
be propagated to the worker processes to allow them to connect to the same
named pipe that ```LocalElasticAgent``` uses.