On the Convergence Properties of Stochastic Gradient Descent with Adaptive Momentum

Authors: Your Name, Collaborator A, Collaborator B Venue: NeurIPS 2025 Date: September 15, 2025

Abstract

Stochastic Gradient Descent (SGD) and its momentum-based variants remain the workhorses of modern deep learning. Despite their empirical success, a rigorous understanding of how adaptive momentum schedules impact convergence under realistic assumptions is still incomplete.

In this work, we study a family of SGD algorithms where the momentum coefficient $\beta_t$ is adapted at each iteration based on the observed gradient signal. Specifically, we consider the update rule:

v_{t+1} = \beta_t \, v_t + (1 - \beta_t) \, \nabla f(x_t; \xi_t), \qquad x_{t+1} = x_t - \eta_t \, v_{t+1}

where $\eta_t$ is the learning rate and $\xi_t$ represents the stochastic sample at iteration $t$ .

Main Results

Theorem 1. Let $f : \mathbb{R}^d \to \mathbb{R}$ be $L$ -smooth and bounded below by $f^*$ . Suppose the stochastic gradients satisfy $\mathbb{E}[\|\nabla f(x_t; \xi_t) - \nabla f(x_t)\|^2] \leq \sigma^2$ . If the adaptive momentum satisfies $\beta_t \in [\beta_{\min}, \beta_{\max}]$ with $0 < \beta_{\min} \leq \beta_{\max} < 1$ , then for a suitably chosen step size $\eta_t = \mathcal{O}(1/\sqrt{T})$ :

\frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E}\left[\|\nabla f(x_t)\|^2\right] \leq \mathcal{O}\!\left(\frac{L(f(x_0) - f^*)}{\sqrt{T}} + \frac{L \sigma^2}{\sqrt{T}}\right)

This establishes an $\mathcal{O}(1/\sqrt{T})$ rate, matching the known lower bounds for first-order stochastic methods in the non-convex setting.

Key Contributions

We define a general framework for adaptive momentum schedules that includes Adam, AMSGrad, and classical heavy-ball momentum as special cases.
We provide a tight convergence bound that does not require the bounded gradient assumption commonly used in prior work.
We validate our theoretical predictions on CIFAR-10 and ImageNet, demonstrating that the bound accurately predicts the relative performance of different momentum schedules.

Discussion

Our analysis reveals that the interplay between $\beta_t$ and $\eta_t$ is more nuanced than previously understood. In particular, aggressive momentum (high $\beta_t$ ) coupled with large step sizes can lead to a variance amplification phenomenon where:

\text{Var}(v_{t+1}) \approx \frac{\beta_t^2}{(1 - \beta_t)^2} \cdot \sigma^2

This suggests that momentum should be decayed in tandem with the learning rate to maintain optimal convergence, a practice commonly observed in well-tuned training pipelines but rarely justified theoretically.