On the Convergence Properties of Stochastic Gradient Descent with Adaptive Momentum
Abstract
Stochastic Gradient Descent (SGD) and its momentum-based variants remain the workhorses of modern deep learning. Despite their empirical success, a rigorous understanding of how adaptive momentum schedules impact convergence under realistic assumptions is still incomplete.
In this work, we study a family of SGD algorithms where the momentum coefficient is adapted at each iteration based on the observed gradient signal. Specifically, we consider the update rule:
where is the learning rate and represents the stochastic sample at iteration .
Main Results
Theorem 1. Let be -smooth and bounded below by . Suppose the stochastic gradients satisfy . If the adaptive momentum satisfies with , then for a suitably chosen step size :
This establishes an rate, matching the known lower bounds for first-order stochastic methods in the non-convex setting.
Key Contributions
- We define a general framework for adaptive momentum schedules that includes Adam, AMSGrad, and classical heavy-ball momentum as special cases.
- We provide a tight convergence bound that does not require the bounded gradient assumption commonly used in prior work.
- We validate our theoretical predictions on CIFAR-10 and ImageNet, demonstrating that the bound accurately predicts the relative performance of different momentum schedules.
Discussion
Our analysis reveals that the interplay between and is more nuanced than previously understood. In particular, aggressive momentum (high ) coupled with large step sizes can lead to a variance amplification phenomenon where:
This suggests that momentum should be decayed in tandem with the learning rate to maintain optimal convergence, a practice commonly observed in well-tuned training pipelines but rarely justified theoretically.