mrpro.algorithms.optimizers.adam

mrpro.algorithms.optimizers.adam(f: Operator[Unpack[tuple[Tensor, ...]], tuple[Tensor, ...]], initial_parameters: Sequence[Tensor], *, max_iterations: int = 100, learning_rate: float = 1e-3, tolerance_change: float = 0, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-8, weight_decay: float = 0, amsgrad: bool = False, decoupled_weight_decay: bool = False, callback: Callable[[OptimizerStatus], bool | None] | None = None) → tuple[Tensor, ...][source]

Adam for non-linear minimization problems.

Adam [KING2015] (Adaptive Moment Estimation) is a first-order optimization algorithm that adapts learning rates for each parameter using estimates of the first and second moments of the gradients.

The parameter update rule is:

\[\begin{split}m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\end{split}\]

where \(g_t\) is the gradient at step \(t\), \(m_t\) and \(v_t\) are biased estimates of the first and second moments, \(\hat{m}_t\) and \(\hat{v}_t\) are bias-corrected estimates, \(\eta\) is the learning rate, \(\epsilon\) is a small constant for numerical stability, \(\beta_1\) and \(\beta_2\) are decay rates for the moment estimates.

Steps of the Adam algorithm:

Initialize parameters and moment estimates (\(m_0\), \(v_0\)).
Compute the gradient of the objective function.
Compute bias-corrected estimates of the moments \(\hat{m}_t\) and \(\hat{v}_t\).
Update parameters using the adaptive step size.

This function wraps PyTorch’s torch.optim.Adam and torch.optim.AdamW implementations, supporting both standard Adam and decoupled weight decay regularization (AdamW) [LOS2019]

References

[KING2015]

Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization. ICLR. https://doi.org/10.48550/arXiv.1412.6980

[LOS2019] (1,2)

Loshchilov I, Hutter F (2019) Decoupled Weight Decay Regularization. ICLR. https://doi.org/10.48550/arXiv.1711.05101

[REDDI2019]

Sashank J. Reddi, Satyen Kale, Sanjiv Kumar (2019) On the Convergence of Adam and Beyond. ICLR. https://doi.org/10.48550/arXiv.1904.09237

Parameters:

f (Operator[Unpack[tuple[Tensor, ...]], tuple[Tensor, ...]]) – scalar-valued function to be optimized.
initial_parameters (Sequence[Tensor]) – sequence (for example list) of parameters to be optimized. Note that these parameters will not be changed. Instead, we create a copy and leave the initial values untouched.
max_iterations (int, default: 100) – maximum number of iterations.
learning_rate (float, default: 1e-3) – learning rate.
tolerance_change (float, default: 0) – teriminate if the change of objective function is smaller than this tolerance.
betas (tuple[float, float], default: (0.9, 0.999)) – coefficients used for computing running averages of gradient and its square.
eps (float, default: 1e-8) – term added to the denominator to improve numerical stability.
weight_decay (float, default: 0) – weight decay (L2 penalty if decoupled_weight_decay is False).
amsgrad (bool, default: False) – whether to use the AMSGrad variant [REDDI2019].
decoupled_weight_decay (bool, default: False) – whether to use Adam (default) or AdamW (if set to True) [LOS2019].
callback (Callable[[OptimizerStatus], bool | None] | None, default: None) – function to be called after each iteration. This can be used to monitor the progress of the algorithm. If it returns False, the algorithm stops at that iteration.

Returns:

optimized parameters