← Research

On the Convergence Properties of Stochastic Gradient Descent with Adaptive Momentum

Authors: Your Name, Collaborator A, Collaborator B Venue: NeurIPS 2025 Date: September 15, 2025

↓ Download PDF

Abstract

Stochastic Gradient Descent (SGD) and its momentum-based variants remain the workhorses of modern deep learning. Despite their empirical success, a rigorous understanding of how adaptive momentum schedules impact convergence under realistic assumptions is still incomplete.

In this work, we study a family of SGD algorithms where the momentum coefficient βt\beta_t is adapted at each iteration based on the observed gradient signal. Specifically, we consider the update rule:

vt+1=βtvt+(1βt)f(xt;ξt),xt+1=xtηtvt+1v_{t+1} = \beta_t \, v_t + (1 - \beta_t) \, \nabla f(x_t; \xi_t), \qquad x_{t+1} = x_t - \eta_t \, v_{t+1}

where ηt\eta_t is the learning rate and ξt\xi_t represents the stochastic sample at iteration tt.

Main Results

Theorem 1. Let f:RdRf : \mathbb{R}^d \to \mathbb{R} be LL-smooth and bounded below by ff^*. Suppose the stochastic gradients satisfy E[f(xt;ξt)f(xt)2]σ2\mathbb{E}[\|\nabla f(x_t; \xi_t) - \nabla f(x_t)\|^2] \leq \sigma^2. If the adaptive momentum satisfies βt[βmin,βmax]\beta_t \in [\beta_{\min}, \beta_{\max}] with 0<βminβmax<10 < \beta_{\min} \leq \beta_{\max} < 1, then for a suitably chosen step size ηt=O(1/T)\eta_t = \mathcal{O}(1/\sqrt{T}):

1Tt=0T1E[f(xt)2]O ⁣(L(f(x0)f)T+Lσ2T)\frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E}\left[\|\nabla f(x_t)\|^2\right] \leq \mathcal{O}\!\left(\frac{L(f(x_0) - f^*)}{\sqrt{T}} + \frac{L \sigma^2}{\sqrt{T}}\right)

This establishes an O(1/T)\mathcal{O}(1/\sqrt{T}) rate, matching the known lower bounds for first-order stochastic methods in the non-convex setting.

Key Contributions

  1. We define a general framework for adaptive momentum schedules that includes Adam, AMSGrad, and classical heavy-ball momentum as special cases.
  2. We provide a tight convergence bound that does not require the bounded gradient assumption commonly used in prior work.
  3. We validate our theoretical predictions on CIFAR-10 and ImageNet, demonstrating that the bound accurately predicts the relative performance of different momentum schedules.

Discussion

Our analysis reveals that the interplay between βt\beta_t and ηt\eta_t is more nuanced than previously understood. In particular, aggressive momentum (high βt\beta_t) coupled with large step sizes can lead to a variance amplification phenomenon where:

Var(vt+1)βt2(1βt)2σ2\text{Var}(v_{t+1}) \approx \frac{\beta_t^2}{(1 - \beta_t)^2} \cdot \sigma^2

This suggests that momentum should be decayed in tandem with the learning rate to maintain optimal convergence, a practice commonly observed in well-tuned training pipelines but rarely justified theoretically.