docs/source/optim.rst - platform/external/pytorch - Git at Google

 torch.optim
 ===================================

 .. automodule:: torch.optim

 How to use an optimizer
 -----------------------

 To use :mod:`torch.optim` you have to construct an optimizer object that will hold
 the current state and will update the parameters based on the computed gradients.

 Constructing it
 ^^^^^^^^^^^^^^^

 To construct an :class:`Optimizer` you have to give it an iterable containing the
 parameters (all should be :class:`~torch.autograd.Variable` s) to optimize. Then,
 you can specify optimizer-specific options such as the learning rate, weight decay, etc.

 Example::

     optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
     optimizer = optim.Adam([var1, var2], lr=0.0001)

 Per-parameter options
 ^^^^^^^^^^^^^^^^^^^^^

 :class:`Optimizer` s also support specifying per-parameter options. To do this, instead
 of passing an iterable of :class:`~torch.autograd.Variable` s, pass in an iterable of
 :class:`dict` s. Each of them will define a separate parameter group, and should contain
 a ``params`` key, containing a list of parameters belonging to it. Other keys
 should match the keyword arguments accepted by the optimizers, and will be used
 as optimization options for this group.

 For example, this is very useful when one wants to specify per-layer learning rates::

     optim.SGD([
                     {'params': model.base.parameters(), 'lr': 1e-2},
                     {'params': model.classifier.parameters()}
                 ], lr=1e-3, momentum=0.9)

 This means that ``model.base``'s parameters will use a learning rate of ``1e-2``, whereas
 ``model.classifier``'s parameters will stick to the default learning rate of ``1e-3``.
 Finally a momentum of ``0.9`` will be used for all parameters.

 .. note::

     You can still pass options as keyword arguments. They will be used as
     defaults, in the groups that didn't override them. This is useful when you
     only want to vary a single option, while keeping all others consistent
     between parameter groups.

 Also consider the following example related to the distinct penalization of parameters.
 Remember that :func:`~torch.nn.Module.parameters` returns an iterable that
 contains all learnable parameters, including biases and other
 parameters that may prefer distinct penalization. To address this, one can specify
 individual penalization weights for each parameter group::

     bias_params = [p for name, p in self.named_parameters() if 'bias' in name]
     others = [p for name, p in self.named_parameters() if 'bias' not in name]

     optim.SGD([
                     {'params': others},
                     {'params': bias_params, 'weight_decay': 0}
                 ], weight_decay=1e-2, lr=1e-2)

 In this manner, bias terms are isolated from non-bias terms, and a ``weight_decay``
 of ``0`` is set specifically for the bias terms, as to avoid any penalization for
 this group.


 Taking an optimization step
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

 All optimizers implement a :func:`~Optimizer.step` method, that updates the
 parameters. It can be used in two ways:

 ``optimizer.step()``
 ~~~~~~~~~~~~~~~~~~~~

 This is a simplified version supported by most optimizers. The function can be
 called once the gradients are computed using e.g.
 :func:`~torch.autograd.Variable.backward`.

 Example::

     for input, target in dataset:
         optimizer.zero_grad()
         output = model(input)
         loss = loss_fn(output, target)
         loss.backward()
         optimizer.step()

 ``optimizer.step(closure)``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Some optimization algorithms such as Conjugate Gradient and LBFGS need to
 reevaluate the function multiple times, so you have to pass in a closure that
 allows them to recompute your model. The closure should clear the gradients,
 compute the loss, and return it.

 Example::

     for input, target in dataset:
         def closure():
             optimizer.zero_grad()
             output = model(input)
             loss = loss_fn(output, target)
             loss.backward()
             return loss
         optimizer.step(closure)

 .. _optimizer-algorithms:

 Base class
 ----------

 .. autoclass:: Optimizer

 .. autosummary::
     :toctree: generated
     :nosignatures:

     Optimizer.add_param_group
     Optimizer.load_state_dict
     Optimizer.register_load_state_dict_pre_hook
     Optimizer.register_load_state_dict_post_hook
     Optimizer.state_dict
     Optimizer.register_state_dict_pre_hook
     Optimizer.register_state_dict_post_hook
     Optimizer.step
     Optimizer.register_step_pre_hook
     Optimizer.register_step_post_hook
     Optimizer.zero_grad

 Algorithms
 ----------

 .. autosummary::
     :toctree: generated
     :nosignatures:

     Adadelta
     Adafactor
     Adagrad
     Adam
     AdamW
     SparseAdam
     Adamax
     ASGD
     LBFGS
     NAdam
     RAdam
     RMSprop
     Rprop
     SGD

 Many of our algorithms have various implementations optimized for performance,
 readability and/or generality, so we attempt to default to the generally fastest
 implementation for the current device if no particular implementation has been
 specified by the user.

 We have 3 major categories of implementations: for-loop, foreach (multi-tensor), and
 fused. The most straightforward implementations are for-loops over the parameters with
 big chunks of computation. For-looping is usually slower than our foreach
 implementations, which combine parameters into a multi-tensor and run the big chunks
 of computation all at once, thereby saving many sequential kernel calls. A few of our
 optimizers have even faster fused implementations, which fuse the big chunks of
 computation into one kernel. We can think of foreach implementations as fusing
 horizontally and fused implementations as fusing vertically on top of that.

 In general, the performance ordering of the 3 implementations is fused > foreach > for-loop.
 So when applicable, we default to foreach over for-loop. Applicable means the foreach
 implementation is available, the user has not specified any implementation-specific kwargs
 (e.g., fused, foreach, differentiable), and all tensors are native. Note that while fused
 should be even faster than foreach, the implementations are newer and we would like to give
 them more bake-in time before flipping the switch everywhere. We summarize the stability status
 for each implementation on the second table below, you are welcome to try them out though!

 Below is a table showing the available and default implementations of each algorithm:

 .. csv-table::
     :header: "Algorithm", "Default", "Has foreach?", "Has fused?"
     :widths: 25, 25, 25, 25
     :delim: ;

     :class:`Adadelta`;foreach;yes;no
     :class:`Adafactor`;for-loop;no;no
     :class:`Adagrad`;foreach;yes;yes (cpu only)
     :class:`Adam`;foreach;yes;yes
     :class:`AdamW`;foreach;yes;yes
     :class:`SparseAdam`;for-loop;no;no
     :class:`Adamax`;foreach;yes;no
     :class:`ASGD`;foreach;yes;no
     :class:`LBFGS`;for-loop;no;no
     :class:`NAdam`;foreach;yes;no
     :class:`RAdam`;foreach;yes;no
     :class:`RMSprop`;foreach;yes;no
     :class:`Rprop`;foreach;yes;no
     :class:`SGD`;foreach;yes;yes

 Below table is showing the stability status for fused implementations:

 .. csv-table::
     :header: "Algorithm", "CPU", "CUDA", "MPS"
     :widths: 25, 25, 25, 25
     :delim: ;

     :class:`Adadelta`;unsupported;unsupported;unsupported
     :class:`Adafactor`;unsupported;unsupported;unsupported
     :class:`Adagrad`;beta;unsupported;unsupported
     :class:`Adam`;beta;stable;beta
     :class:`AdamW`;beta;stable;beta
     :class:`SparseAdam`;unsupported;unsupported;unsupported
     :class:`Adamax`;unsupported;unsupported;unsupported
     :class:`ASGD`;unsupported;unsupported;unsupported
     :class:`LBFGS`;unsupported;unsupported;unsupported
     :class:`NAdam`;unsupported;unsupported;unsupported
     :class:`RAdam`;unsupported;unsupported;unsupported
     :class:`RMSprop`;unsupported;unsupported;unsupported
     :class:`Rprop`;unsupported;unsupported;unsupported
     :class:`SGD`;beta;beta;beta

 How to adjust learning rate
 ---------------------------

 :class:`torch.optim.lr_scheduler.LRScheduler` provides several methods to adjust the learning
 rate based on the number of epochs. :class:`torch.optim.lr_scheduler.ReduceLROnPlateau`
 allows dynamic learning rate reducing based on some validation measurements.

 Learning rate scheduling should be applied after optimizer's update; e.g., you
 should write your code this way:

 Example::

     optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
     scheduler = ExponentialLR(optimizer, gamma=0.9)

     for epoch in range(20):
         for input, target in dataset:
             optimizer.zero_grad()
             output = model(input)
             loss = loss_fn(output, target)
             loss.backward()
             optimizer.step()
         scheduler.step()

 Most learning rate schedulers can be called back-to-back (also referred to as
 chaining schedulers). The result is that each scheduler is applied one after the
 other on the learning rate obtained by the one preceding it.

 Example::

     optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
     scheduler1 = ExponentialLR(optimizer, gamma=0.9)
     scheduler2 = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)

     for epoch in range(20):
         for input, target in dataset:
             optimizer.zero_grad()
             output = model(input)
             loss = loss_fn(output, target)
             loss.backward()
             optimizer.step()
         scheduler1.step()
         scheduler2.step()

 In many places in the documentation, we will use the following template to refer to schedulers
 algorithms.

     >>> scheduler = ...
     >>> for epoch in range(100):
     >>>     train(...)
     >>>     validate(...)
     >>>     scheduler.step()

 .. warning::
   Prior to PyTorch 1.1.0, the learning rate scheduler was expected to be called before
   the optimizer's update; 1.1.0 changed this behavior in a BC-breaking way.  If you use
   the learning rate scheduler (calling ``scheduler.step()``) before the optimizer's update
   (calling ``optimizer.step()``), this will skip the first value of the learning rate schedule.
   If you are unable to reproduce results after upgrading to PyTorch 1.1.0, please check
   if you are calling ``scheduler.step()`` at the wrong time.


 .. autosummary::
     :toctree: generated
     :nosignatures:

     lr_scheduler.LRScheduler
     lr_scheduler.LambdaLR
     lr_scheduler.MultiplicativeLR
     lr_scheduler.StepLR
     lr_scheduler.MultiStepLR
     lr_scheduler.ConstantLR
     lr_scheduler.LinearLR
     lr_scheduler.ExponentialLR
     lr_scheduler.PolynomialLR
     lr_scheduler.CosineAnnealingLR
     lr_scheduler.ChainedScheduler
     lr_scheduler.SequentialLR
     lr_scheduler.ReduceLROnPlateau
     lr_scheduler.CyclicLR
     lr_scheduler.OneCycleLR
     lr_scheduler.CosineAnnealingWarmRestarts

 Weight Averaging (SWA and EMA)
 ------------------------------

 :class:`torch.optim.swa_utils.AveragedModel` implements Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA),
 :class:`torch.optim.swa_utils.SWALR` implements the SWA learning rate scheduler and
 :func:`torch.optim.swa_utils.update_bn` is a utility function used to update SWA/EMA batch
 normalization statistics at the end of training.

 SWA has been proposed in `Averaging Weights Leads to Wider Optima and Better Generalization`_.

 EMA is a widely known technique to reduce the training time by reducing the number of weight updates needed. It is a variation of `Polyak averaging`_, but using exponential weights instead of equal weights across iterations.

 .. _`Averaging Weights Leads to Wider Optima and Better Generalization`: https://arxiv.org/abs/1803.05407

 .. _`Polyak averaging`: https://paperswithcode.com/method/polyak-averaging

 Constructing averaged models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The `AveragedModel` class serves to compute the weights of the SWA or EMA model.

 You can create an SWA averaged model by running:

 >>> averaged_model = AveragedModel(model)

 EMA models are constructed by specifying the ``multi_avg_fn`` argument as follows:

 >>> decay = 0.999
 >>> averaged_model = AveragedModel(model, multi_avg_fn=get_ema_multi_avg_fn(decay))

 Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to :func:`torch.optim.swa_utils.get_ema_multi_avg_fn`, the default is 0.999.

 :func:`torch.optim.swa_utils.get_ema_multi_avg_fn` returns a function that applies the following EMA equation to the weights:

 .. math:: W^\textrm{EMA}_{t+1} = \alpha W^\textrm{EMA}_{t} + (1 - \alpha) W^\textrm{model}_t

 where alpha is the EMA decay.

 Here the model ``model`` can be an arbitrary :class:`torch.nn.Module` object. ``averaged_model``
 will keep track of the running averages of the parameters of the ``model``. To update these
 averages, you should use the :func:`update_parameters` function after the `optimizer.step()`:

 >>> averaged_model.update_parameters(model)

 For SWA and EMA, this call is usually done right after the optimizer ``step()``. In the case of SWA, this is usually skipped for some numbers of steps at the beginning of the training.

 Custom averaging strategies
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

 By default, :class:`torch.optim.swa_utils.AveragedModel` computes a running equal average of
 the parameters that you provide, but you can also use custom averaging functions with the
 ``avg_fn`` or ``multi_avg_fn`` parameters:

 - ``avg_fn`` allows defining a function operating on each parameter tuple (averaged parameter, model parameter) and should return the new averaged parameter.
 - ``multi_avg_fn`` allows defining more efficient operations acting on a tuple of parameter lists, (averaged parameter list, model parameter list), at the same time, for example using the ``torch._foreach*`` functions. This function must update the averaged parameters in-place.

 In the following example ``ema_model`` computes an exponential moving average using the ``avg_fn`` parameter:

 >>> ema_avg = lambda averaged_model_parameter, model_parameter, num_averaged:\
 >>>         0.9 * averaged_model_parameter + 0.1 * model_parameter
 >>> ema_model = torch.optim.swa_utils.AveragedModel(model, avg_fn=ema_avg)


 In the following example ``ema_model`` computes an exponential moving average using the more efficient ``multi_avg_fn`` parameter:

 >>> ema_model = AveragedModel(model, multi_avg_fn=get_ema_multi_avg_fn(0.9))


 SWA learning rate schedules
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Typically, in SWA the learning rate is set to a high constant value. :class:`SWALR` is a
 learning rate scheduler that anneals the learning rate to a fixed value, and then keeps it
 constant. For example, the following code creates a scheduler that linearly anneals the
 learning rate from its initial value to 0.05 in 5 epochs within each parameter group:

 >>> swa_scheduler = torch.optim.swa_utils.SWALR(optimizer, \
 >>>         anneal_strategy="linear", anneal_epochs=5, swa_lr=0.05)

 You can also use cosine annealing to a fixed value instead of linear annealing by setting
 ``anneal_strategy="cos"``.


 Taking care of batch normalization
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 :func:`update_bn` is a utility function that allows to compute the batchnorm statistics for the SWA model
 on a given dataloader ``loader`` at the end of training:

 >>> torch.optim.swa_utils.update_bn(loader, swa_model)

 :func:`update_bn` applies the ``swa_model`` to every element in the dataloader and computes the activation
 statistics for each batch normalization layer in the model.

 .. warning::
     :func:`update_bn` assumes that each batch in the dataloader ``loader`` is either a tensors or a list of
     tensors where the first element is the tensor that the network ``swa_model`` should be applied to.
     If your dataloader has a different structure, you can update the batch normalization statistics of the
     ``swa_model`` by doing a forward pass with the ``swa_model`` on each element of the dataset.


 Putting it all together: SWA
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 In the example below, ``swa_model`` is the SWA model that accumulates the averages of the weights.
 We train the model for a total of 300 epochs and we switch to the SWA learning rate schedule
 and start to collect SWA averages of the parameters at epoch 160:

 >>> loader, optimizer, model, loss_fn = ...
 >>> swa_model = torch.optim.swa_utils.AveragedModel(model)
 >>> scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=300)
 >>> swa_start = 160
 >>> swa_scheduler = SWALR(optimizer, swa_lr=0.05)
 >>>
 >>> for epoch in range(300):
 >>>       for input, target in loader:
 >>>           optimizer.zero_grad()
 >>>           loss_fn(model(input), target).backward()
 >>>           optimizer.step()
 >>>       if epoch > swa_start:
 >>>           swa_model.update_parameters(model)
 >>>           swa_scheduler.step()
 >>>       else:
 >>>           scheduler.step()
 >>>
 >>> # Update bn statistics for the swa_model at the end
 >>> torch.optim.swa_utils.update_bn(loader, swa_model)
 >>> # Use swa_model to make predictions on test data
 >>> preds = swa_model(test_input)


 Putting it all together: EMA
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 In the example below, ``ema_model`` is the EMA model that accumulates the exponentially-decayed averages of the weights with a decay rate of 0.999.
 We train the model for a total of 300 epochs and start to collect EMA averages immediately.

 >>> loader, optimizer, model, loss_fn = ...
 >>> ema_model = torch.optim.swa_utils.AveragedModel(model, \
 >>>             multi_avg_fn=torch.optim.swa_utils.get_ema_multi_avg_fn(0.999))
 >>>
 >>> for epoch in range(300):
 >>>       for input, target in loader:
 >>>           optimizer.zero_grad()
 >>>           loss_fn(model(input), target).backward()
 >>>           optimizer.step()
 >>>           ema_model.update_parameters(model)
 >>>
 >>> # Update bn statistics for the ema_model at the end
 >>> torch.optim.swa_utils.update_bn(loader, ema_model)
 >>> # Use ema_model to make predictions on test data
 >>> preds = ema_model(test_input)

 .. autosummary::
     :toctree: generated
     :nosignatures:

     swa_utils.AveragedModel
     swa_utils.SWALR


 .. autofunction:: torch.optim.swa_utils.get_ema_multi_avg_fn
 .. autofunction:: torch.optim.swa_utils.update_bn


 .. This module needs to be documented. Adding here in the meantime
 .. for tracking purposes
 .. py:module:: torch.optim.adadelta
 .. py:module:: torch.optim.adagrad
 .. py:module:: torch.optim.adam
 .. py:module:: torch.optim.adamax
 .. py:module:: torch.optim.adamw
 .. py:module:: torch.optim.asgd
 .. py:module:: torch.optim.lbfgs
 .. py:module:: torch.optim.lr_scheduler
 .. py:module:: torch.optim.nadam
 .. py:module:: torch.optim.optimizer
 .. py:module:: torch.optim.radam
 .. py:module:: torch.optim.rmsprop
 .. py:module:: torch.optim.rprop
 .. py:module:: torch.optim.sgd
 .. py:module:: torch.optim.sparse_adam
 .. py:module:: torch.optim.swa_utils
	torch.optim
	===================================

	.. automodule:: torch.optim

	How to use an optimizer
	-----------------------

	To use :mod:`torch.optim` you have to construct an optimizer object that will hold
	the current state and will update the parameters based on the computed gradients.

	Constructing it
	^^^^^^^^^^^^^^^

	To construct an :class:`Optimizer` you have to give it an iterable containing the
	parameters (all should be :class:`~torch.autograd.Variable` s) to optimize. Then,
	you can specify optimizer-specific options such as the learning rate, weight decay, etc.

	Example::

	optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
	optimizer = optim.Adam([var1, var2], lr=0.0001)

	Per-parameter options
	^^^^^^^^^^^^^^^^^^^^^

	:class:`Optimizer` s also support specifying per-parameter options. To do this, instead
	of passing an iterable of :class:`~torch.autograd.Variable` s, pass in an iterable of
	:class:`dict` s. Each of them will define a separate parameter group, and should contain
	a ``params`` key, containing a list of parameters belonging to it. Other keys
	should match the keyword arguments accepted by the optimizers, and will be used
	as optimization options for this group.

	For example, this is very useful when one wants to specify per-layer learning rates::

	optim.SGD([
	{'params': model.base.parameters(), 'lr': 1e-2},
	{'params': model.classifier.parameters()}
	], lr=1e-3, momentum=0.9)

	This means that ``model.base``'s parameters will use a learning rate of ``1e-2``, whereas
	``model.classifier``'s parameters will stick to the default learning rate of ``1e-3``.
	Finally a momentum of ``0.9`` will be used for all parameters.

	.. note::

	You can still pass options as keyword arguments. They will be used as
	defaults, in the groups that didn't override them. This is useful when you
	only want to vary a single option, while keeping all others consistent
	between parameter groups.

	Also consider the following example related to the distinct penalization of parameters.
	Remember that :func:`~torch.nn.Module.parameters` returns an iterable that
	contains all learnable parameters, including biases and other
	parameters that may prefer distinct penalization. To address this, one can specify
	individual penalization weights for each parameter group::

	bias_params = [p for name, p in self.named_parameters() if 'bias' in name]
	others = [p for name, p in self.named_parameters() if 'bias' not in name]

	optim.SGD([
	{'params': others},
	{'params': bias_params, 'weight_decay': 0}
	], weight_decay=1e-2, lr=1e-2)

	In this manner, bias terms are isolated from non-bias terms, and a ``weight_decay``
	of ``0`` is set specifically for the bias terms, as to avoid any penalization for
	this group.


	Taking an optimization step
	^^^^^^^^^^^^^^^^^^^^^^^^^^^

	All optimizers implement a :func:`~Optimizer.step` method, that updates the
	parameters. It can be used in two ways:

	``optimizer.step()``
	~~~~~~~~~~~~~~~~~~~~

	This is a simplified version supported by most optimizers. The function can be
	called once the gradients are computed using e.g.
	:func:`~torch.autograd.Variable.backward`.

	Example::

	for input, target in dataset:
	optimizer.zero_grad()
	output = model(input)
	loss = loss_fn(output, target)
	loss.backward()
	optimizer.step()

	``optimizer.step(closure)``
	~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Some optimization algorithms such as Conjugate Gradient and LBFGS need to
	reevaluate the function multiple times, so you have to pass in a closure that
	allows them to recompute your model. The closure should clear the gradients,
	compute the loss, and return it.

	Example::

	for input, target in dataset:
	def closure():
	optimizer.zero_grad()
	output = model(input)
	loss = loss_fn(output, target)
	loss.backward()
	return loss
	optimizer.step(closure)

	.. _optimizer-algorithms:

	Base class
	----------

	.. autoclass:: Optimizer

	.. autosummary::
	:toctree: generated
	:nosignatures:

	Optimizer.add_param_group
	Optimizer.load_state_dict
	Optimizer.register_load_state_dict_pre_hook
	Optimizer.register_load_state_dict_post_hook
	Optimizer.state_dict
	Optimizer.register_state_dict_pre_hook
	Optimizer.register_state_dict_post_hook
	Optimizer.step
	Optimizer.register_step_pre_hook
	Optimizer.register_step_post_hook
	Optimizer.zero_grad

	Algorithms
	----------

	.. autosummary::
	:toctree: generated
	:nosignatures:

	Adadelta
	Adafactor
	Adagrad
	Adam
	AdamW
	SparseAdam
	Adamax
	ASGD
	LBFGS
	NAdam
	RAdam
	RMSprop
	Rprop
	SGD

	Many of our algorithms have various implementations optimized for performance,
	readability and/or generality, so we attempt to default to the generally fastest
	implementation for the current device if no particular implementation has been
	specified by the user.

	We have 3 major categories of implementations: for-loop, foreach (multi-tensor), and
	fused. The most straightforward implementations are for-loops over the parameters with
	big chunks of computation. For-looping is usually slower than our foreach
	implementations, which combine parameters into a multi-tensor and run the big chunks
	of computation all at once, thereby saving many sequential kernel calls. A few of our
	optimizers have even faster fused implementations, which fuse the big chunks of
	computation into one kernel. We can think of foreach implementations as fusing
	horizontally and fused implementations as fusing vertically on top of that.

	In general, the performance ordering of the 3 implementations is fused > foreach > for-loop.
	So when applicable, we default to foreach over for-loop. Applicable means the foreach
	implementation is available, the user has not specified any implementation-specific kwargs
	(e.g., fused, foreach, differentiable), and all tensors are native. Note that while fused
	should be even faster than foreach, the implementations are newer and we would like to give
	them more bake-in time before flipping the switch everywhere. We summarize the stability status
	for each implementation on the second table below, you are welcome to try them out though!

	Below is a table showing the available and default implementations of each algorithm:

	.. csv-table::
	:header: "Algorithm", "Default", "Has foreach?", "Has fused?"
	:widths: 25, 25, 25, 25
	:delim: ;

	:class:`Adadelta`;foreach;yes;no
	:class:`Adafactor`;for-loop;no;no
	:class:`Adagrad`;foreach;yes;yes (cpu only)
	:class:`Adam`;foreach;yes;yes
	:class:`AdamW`;foreach;yes;yes
	:class:`SparseAdam`;for-loop;no;no
	:class:`Adamax`;foreach;yes;no
	:class:`ASGD`;foreach;yes;no
	:class:`LBFGS`;for-loop;no;no
	:class:`NAdam`;foreach;yes;no
	:class:`RAdam`;foreach;yes;no
	:class:`RMSprop`;foreach;yes;no
	:class:`Rprop`;foreach;yes;no
	:class:`SGD`;foreach;yes;yes

	Below table is showing the stability status for fused implementations:

	.. csv-table::
	:header: "Algorithm", "CPU", "CUDA", "MPS"
	:widths: 25, 25, 25, 25
	:delim: ;

	:class:`Adadelta`;unsupported;unsupported;unsupported
	:class:`Adafactor`;unsupported;unsupported;unsupported
	:class:`Adagrad`;beta;unsupported;unsupported
	:class:`Adam`;beta;stable;beta
	:class:`AdamW`;beta;stable;beta
	:class:`SparseAdam`;unsupported;unsupported;unsupported
	:class:`Adamax`;unsupported;unsupported;unsupported
	:class:`ASGD`;unsupported;unsupported;unsupported
	:class:`LBFGS`;unsupported;unsupported;unsupported
	:class:`NAdam`;unsupported;unsupported;unsupported
	:class:`RAdam`;unsupported;unsupported;unsupported
	:class:`RMSprop`;unsupported;unsupported;unsupported
	:class:`Rprop`;unsupported;unsupported;unsupported
	:class:`SGD`;beta;beta;beta

	How to adjust learning rate
	---------------------------

	:class:`torch.optim.lr_scheduler.LRScheduler` provides several methods to adjust the learning
	rate based on the number of epochs. :class:`torch.optim.lr_scheduler.ReduceLROnPlateau`
	allows dynamic learning rate reducing based on some validation measurements.

	Learning rate scheduling should be applied after optimizer's update; e.g., you
	should write your code this way:

	Example::

	optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
	scheduler = ExponentialLR(optimizer, gamma=0.9)

	for epoch in range(20):
	for input, target in dataset:
	optimizer.zero_grad()
	output = model(input)
	loss = loss_fn(output, target)
	loss.backward()
	optimizer.step()
	scheduler.step()

	Most learning rate schedulers can be called back-to-back (also referred to as
	chaining schedulers). The result is that each scheduler is applied one after the
	other on the learning rate obtained by the one preceding it.

	Example::

	optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
	scheduler1 = ExponentialLR(optimizer, gamma=0.9)
	scheduler2 = MultiStepLR(optimizer, milestones=[30,80], gamma=0.1)

	for epoch in range(20):
	for input, target in dataset:
	optimizer.zero_grad()
	output = model(input)
	loss = loss_fn(output, target)
	loss.backward()
	optimizer.step()
	scheduler1.step()
	scheduler2.step()

	In many places in the documentation, we will use the following template to refer to schedulers
	algorithms.

	>>> scheduler = ...
	>>> for epoch in range(100):
	>>> train(...)
	>>> validate(...)
	>>> scheduler.step()

	.. warning::
	Prior to PyTorch 1.1.0, the learning rate scheduler was expected to be called before
	the optimizer's update; 1.1.0 changed this behavior in a BC-breaking way. If you use
	the learning rate scheduler (calling ``scheduler.step()``) before the optimizer's update
	(calling ``optimizer.step()``), this will skip the first value of the learning rate schedule.
	If you are unable to reproduce results after upgrading to PyTorch 1.1.0, please check
	if you are calling ``scheduler.step()`` at the wrong time.


	.. autosummary::
	:toctree: generated
	:nosignatures:

	lr_scheduler.LRScheduler
	lr_scheduler.LambdaLR
	lr_scheduler.MultiplicativeLR
	lr_scheduler.StepLR
	lr_scheduler.MultiStepLR
	lr_scheduler.ConstantLR
	lr_scheduler.LinearLR
	lr_scheduler.ExponentialLR
	lr_scheduler.PolynomialLR
	lr_scheduler.CosineAnnealingLR
	lr_scheduler.ChainedScheduler
	lr_scheduler.SequentialLR
	lr_scheduler.ReduceLROnPlateau
	lr_scheduler.CyclicLR
	lr_scheduler.OneCycleLR
	lr_scheduler.CosineAnnealingWarmRestarts

	Weight Averaging (SWA and EMA)
	------------------------------

	:class:`torch.optim.swa_utils.AveragedModel` implements Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA),
	:class:`torch.optim.swa_utils.SWALR` implements the SWA learning rate scheduler and
	:func:`torch.optim.swa_utils.update_bn` is a utility function used to update SWA/EMA batch
	normalization statistics at the end of training.

	SWA has been proposed in `Averaging Weights Leads to Wider Optima and Better Generalization`_.

	EMA is a widely known technique to reduce the training time by reducing the number of weight updates needed. It is a variation of `Polyak averaging`_, but using exponential weights instead of equal weights across iterations.

	.. _`Averaging Weights Leads to Wider Optima and Better Generalization`: https://arxiv.org/abs/1803.05407

	.. _`Polyak averaging`: https://paperswithcode.com/method/polyak-averaging

	Constructing averaged models
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	The `AveragedModel` class serves to compute the weights of the SWA or EMA model.

	You can create an SWA averaged model by running:

	>>> averaged_model = AveragedModel(model)

	EMA models are constructed by specifying the ``multi_avg_fn`` argument as follows:

	>>> decay = 0.999
	>>> averaged_model = AveragedModel(model, multi_avg_fn=get_ema_multi_avg_fn(decay))

	Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to :func:`torch.optim.swa_utils.get_ema_multi_avg_fn`, the default is 0.999.

	:func:`torch.optim.swa_utils.get_ema_multi_avg_fn` returns a function that applies the following EMA equation to the weights:

	.. math:: W^\textrm{EMA}_{t+1} = \alpha W^\textrm{EMA}_{t} + (1 - \alpha) W^\textrm{model}_t

	where alpha is the EMA decay.

	Here the model ``model`` can be an arbitrary :class:`torch.nn.Module` object. ``averaged_model``
	will keep track of the running averages of the parameters of the ``model``. To update these
	averages, you should use the :func:`update_parameters` function after the `optimizer.step()`:

	>>> averaged_model.update_parameters(model)

	For SWA and EMA, this call is usually done right after the optimizer ``step()``. In the case of SWA, this is usually skipped for some numbers of steps at the beginning of the training.

	Custom averaging strategies
	^^^^^^^^^^^^^^^^^^^^^^^^^^^

	By default, :class:`torch.optim.swa_utils.AveragedModel` computes a running equal average of
	the parameters that you provide, but you can also use custom averaging functions with the
	``avg_fn`` or ``multi_avg_fn`` parameters:

	- ``avg_fn`` allows defining a function operating on each parameter tuple (averaged parameter, model parameter) and should return the new averaged parameter.
	- ``multi_avg_fn`` allows defining more efficient operations acting on a tuple of parameter lists, (averaged parameter list, model parameter list), at the same time, for example using the ``torch._foreach*`` functions. This function must update the averaged parameters in-place.

	In the following example ``ema_model`` computes an exponential moving average using the ``avg_fn`` parameter:

	>>> ema_avg = lambda averaged_model_parameter, model_parameter, num_averaged:\
	>>> 0.9 * averaged_model_parameter + 0.1 * model_parameter
	>>> ema_model = torch.optim.swa_utils.AveragedModel(model, avg_fn=ema_avg)


	In the following example ``ema_model`` computes an exponential moving average using the more efficient ``multi_avg_fn`` parameter:

	>>> ema_model = AveragedModel(model, multi_avg_fn=get_ema_multi_avg_fn(0.9))


	SWA learning rate schedules
	^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Typically, in SWA the learning rate is set to a high constant value. :class:`SWALR` is a
	learning rate scheduler that anneals the learning rate to a fixed value, and then keeps it
	constant. For example, the following code creates a scheduler that linearly anneals the
	learning rate from its initial value to 0.05 in 5 epochs within each parameter group:

	>>> swa_scheduler = torch.optim.swa_utils.SWALR(optimizer, \
	>>> anneal_strategy="linear", anneal_epochs=5, swa_lr=0.05)

	You can also use cosine annealing to a fixed value instead of linear annealing by setting
	``anneal_strategy="cos"``.


	Taking care of batch normalization
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	:func:`update_bn` is a utility function that allows to compute the batchnorm statistics for the SWA model
	on a given dataloader ``loader`` at the end of training:

	>>> torch.optim.swa_utils.update_bn(loader, swa_model)

	:func:`update_bn` applies the ``swa_model`` to every element in the dataloader and computes the activation
	statistics for each batch normalization layer in the model.

	.. warning::
	:func:`update_bn` assumes that each batch in the dataloader ``loader`` is either a tensors or a list of
	tensors where the first element is the tensor that the network ``swa_model`` should be applied to.
	If your dataloader has a different structure, you can update the batch normalization statistics of the
	``swa_model`` by doing a forward pass with the ``swa_model`` on each element of the dataset.




	Putting it all together: SWA
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	In the example below, ``swa_model`` is the SWA model that accumulates the averages of the weights.
	We train the model for a total of 300 epochs and we switch to the SWA learning rate schedule
	and start to collect SWA averages of the parameters at epoch 160:

	>>> loader, optimizer, model, loss_fn = ...
	>>> swa_model = torch.optim.swa_utils.AveragedModel(model)
	>>> scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=300)
	>>> swa_start = 160
	>>> swa_scheduler = SWALR(optimizer, swa_lr=0.05)
	>>>
	>>> for epoch in range(300):
	>>> for input, target in loader:
	>>> optimizer.zero_grad()
	>>> loss_fn(model(input), target).backward()
	>>> optimizer.step()
	>>> if epoch > swa_start:
	>>> swa_model.update_parameters(model)
	>>> swa_scheduler.step()
	>>> else:
	>>> scheduler.step()
	>>>
	>>> # Update bn statistics for the swa_model at the end
	>>> torch.optim.swa_utils.update_bn(loader, swa_model)
	>>> # Use swa_model to make predictions on test data
	>>> preds = swa_model(test_input)


	Putting it all together: EMA
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	In the example below, ``ema_model`` is the EMA model that accumulates the exponentially-decayed averages of the weights with a decay rate of 0.999.
	We train the model for a total of 300 epochs and start to collect EMA averages immediately.

	>>> loader, optimizer, model, loss_fn = ...
	>>> ema_model = torch.optim.swa_utils.AveragedModel(model, \
	>>> multi_avg_fn=torch.optim.swa_utils.get_ema_multi_avg_fn(0.999))
	>>>
	>>> for epoch in range(300):
	>>> for input, target in loader:
	>>> optimizer.zero_grad()
	>>> loss_fn(model(input), target).backward()
	>>> optimizer.step()
	>>> ema_model.update_parameters(model)
	>>>
	>>> # Update bn statistics for the ema_model at the end
	>>> torch.optim.swa_utils.update_bn(loader, ema_model)
	>>> # Use ema_model to make predictions on test data
	>>> preds = ema_model(test_input)

	.. autosummary::
	:toctree: generated
	:nosignatures:

	swa_utils.AveragedModel
	swa_utils.SWALR


	.. autofunction:: torch.optim.swa_utils.get_ema_multi_avg_fn
	.. autofunction:: torch.optim.swa_utils.update_bn


	.. This module needs to be documented. Adding here in the meantime
	.. for tracking purposes
	.. py:module:: torch.optim.adadelta
	.. py:module:: torch.optim.adagrad
	.. py:module:: torch.optim.adam
	.. py:module:: torch.optim.adamax
	.. py:module:: torch.optim.adamw
	.. py:module:: torch.optim.asgd
	.. py:module:: torch.optim.lbfgs
	.. py:module:: torch.optim.lr_scheduler
	.. py:module:: torch.optim.nadam
	.. py:module:: torch.optim.optimizer
	.. py:module:: torch.optim.radam
	.. py:module:: torch.optim.rmsprop
	.. py:module:: torch.optim.rprop
	.. py:module:: torch.optim.sgd
	.. py:module:: torch.optim.sparse_adam
	.. py:module:: torch.optim.swa_utils