mrpro.algorithms.optimizers.adam
- mrpro.algorithms.optimizers.adam(f: Operator[Unpack, tuple[Tensor, ...]], initial_parameters: Sequence[Tensor], max_iter: int, lr: float = 1e-3, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-8, weight_decay: float = 0, amsgrad: bool = False, decoupled_weight_decay: bool = False, callback: Callable[[OptimizerStatus], None] | None = None) tuple[Tensor, ...] [source]
Adam for non-linear minimization problems.
Adam [KING2015] (Adaptive Moment Estimation) is a first-order optimization algorithm that adapts learning rates for each parameter using estimates of the first and second moments of the gradients.
The parameter update rule is:
\[\begin{split}m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\end{split}\]where \(g_t\) is the gradient at step \(t\), \(m_t\) and \(v_t\) are biased estimates of the first and second moments, \(\hat{m}_t\) and \(\hat{v}_t\) are bias-corrected estimates, \(\eta\) is the learning rate, \(\epsilon\) is a small constant for numerical stability, \(\beta_1\) and \(\beta_2\) are decay rates for the moment estimates.
Steps of the Adam algorithm:
Initialize parameters and moment estimates (\(m_0\), \(v_0\)).
Compute the gradient of the objective function.
Compute bias-corrected estimates of the moments \(\hat{m}_t\) and \(\hat{v}_t\).
Update parameters using the adaptive step size.
This function wraps PyTorch’s
torch.optim.Adam
andtorch.optim.AdamW
implementations, supporting both standard Adam and decoupled weight decay regularization (AdamW) [LOS2019]References
[KING2015]Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization. ICLR. https://doi.org/10.48550/arXiv.1412.6980
[LOS2019] (1,2)Loshchilov I, Hutter F (2019) Decoupled Weight Decay Regularization. ICLR. https://doi.org/10.48550/arXiv.1711.05101
[REDDI2019]Sashank J. Reddi, Satyen Kale, Sanjiv Kumar (2019) On the Convergence of Adam and Beyond. ICLR. https://doi.org/10.48550/arXiv.1904.09237
- Parameters:
f (
Operator
[Unpack
[tuple
[Tensor
,...
]],tuple
[Tensor
,...
]]) – scalar-valued function to be optimizedinitial_parameters (
Sequence
[Tensor
]) – Sequence (for example list) of parameters to be optimized. Note that these parameters will not be changed. Instead, we create a copy and leave the initial values untouched.max_iter (
int
) – maximum number of iterationslr (
float
, default:1e-3
) – learning ratebetas (
tuple
[float
,float
], default:(0.9, 0.999)
) – coefficients used for computing running averages of gradient and its squareeps (
float
, default:1e-8
) – term added to the denominator to improve numerical stabilityweight_decay (
float
, default:0
) – weight decay (L2 penalty ifdecoupled_weight_decay
isFalse
)amsgrad (
bool
, default:False
) – whether to use the AMSGrad variant [REDDI2019]decoupled_weight_decay (
bool
, default:False
) – whether to use Adam (default) or AdamW (if set toTrue
) [LOS2019]callback (
Callable
[[OptimizerStatus
],None
] |None
, default:None
) – function to be called after each iteration
- Returns:
list of optimized parameters