mrpro.algorithms.optimizers.adam

mrpro.algorithms.optimizers.adam(f: Operator[Unpack, tuple[Tensor, ...]], initial_parameters: Sequence[Tensor], max_iter: int, lr: float = 1e-3, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-8, weight_decay: float = 0, amsgrad: bool = False, decoupled_weight_decay: bool = False, callback: Callable[[OptimizerStatus], None] | None = None) tuple[Tensor, ...][source]

Adam for non-linear minimization problems.

Adam [KING2015] (Adaptive Moment Estimation) is a first-order optimization algorithm that adapts learning rates for each parameter using estimates of the first and second moments of the gradients.

The parameter update rule is:

\[\begin{split}m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\end{split}\]

where \(g_t\) is the gradient at step \(t\), \(m_t\) and \(v_t\) are biased estimates of the first and second moments, \(\hat{m}_t\) and \(\hat{v}_t\) are bias-corrected estimates, \(\eta\) is the learning rate, \(\epsilon\) is a small constant for numerical stability, \(\beta_1\) and \(\beta_2\) are decay rates for the moment estimates.

Steps of the Adam algorithm:

  1. Initialize parameters and moment estimates (\(m_0\), \(v_0\)).

  2. Compute the gradient of the objective function.

  3. Compute bias-corrected estimates of the moments \(\hat{m}_t\) and \(\hat{v}_t\).

  4. Update parameters using the adaptive step size.

This function wraps PyTorch’s torch.optim.Adam and torch.optim.AdamW implementations, supporting both standard Adam and decoupled weight decay regularization (AdamW) [LOS2019]

References

[KING2015]

Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization. ICLR. https://doi.org/10.48550/arXiv.1412.6980

[LOS2019] (1,2)

Loshchilov I, Hutter F (2019) Decoupled Weight Decay Regularization. ICLR. https://doi.org/10.48550/arXiv.1711.05101

[REDDI2019]

Sashank J. Reddi, Satyen Kale, Sanjiv Kumar (2019) On the Convergence of Adam and Beyond. ICLR. https://doi.org/10.48550/arXiv.1904.09237

Parameters:
  • f (Operator[Unpack[tuple[Tensor, ...]], tuple[Tensor, ...]]) – scalar-valued function to be optimized

  • initial_parameters (Sequence[Tensor]) – Sequence (for example list) of parameters to be optimized. Note that these parameters will not be changed. Instead, we create a copy and leave the initial values untouched.

  • max_iter (int) – maximum number of iterations

  • lr (float, default: 1e-3) – learning rate

  • betas (tuple[float, float], default: (0.9, 0.999)) – coefficients used for computing running averages of gradient and its square

  • eps (float, default: 1e-8) – term added to the denominator to improve numerical stability

  • weight_decay (float, default: 0) – weight decay (L2 penalty if decoupled_weight_decay is False)

  • amsgrad (bool, default: False) – whether to use the AMSGrad variant [REDDI2019]

  • decoupled_weight_decay (bool, default: False) – whether to use Adam (default) or AdamW (if set to True) [LOS2019]

  • callback (Callable[[OptimizerStatus], None] | None, default: None) – function to be called after each iteration

Returns:

list of optimized parameters