Optimizers (SGD · Momentum · Adam)

FoundationsIntermediate~8 min

Optimizers (SGD · Momentum · Adam) — Optimizers are the update rules that drive gradient descent. Stochastic gradient descent steps on noisy mini-batch gradients; momentum accumulates velocity to power through ravines; Adam adapts a per-parameter step size. They differ most on hard surfaces like ravines and saddles.

Plain gradient descent zig-zags through narrow valleys and stalls on plateaus. Momentum gives it inertia like a rolling ball; Adam adapts each parameter’s step size automatically. Race all three down the same surface and the differences are obvious.

SGD
Momentum
Adam

Iteration 0 / 120

Surface

Learning rate0.080

Iteration 0 / 120

Surface

Learning rate0.080

The idea in plain words

Plain gradient descent always steps straight downhill, which zig-zags painfully across a narrow valley. Optimizers change the update rule. Momentum accumulates velocity like a rolling ball, powering through ravines. Adam adapts a separate step size for each direction, so steep and shallow axes both move sensibly.

Race all three down the same surface and the differences are obvious: on a ravine, SGD stutters, Momentum overshoots and recovers, Adam glides. On a saddle, plain SGD can stall where the gradient nearly vanishes.

Now, the math

Each optimizer transforms the raw gradient before stepping:

\text{SGD:}\quad \theta \leftarrow \theta - \eta\, g

\text{Momentum:}\quad v \leftarrow \beta v + g,\quad \theta \leftarrow \theta - \eta\, v

\text{Adam:}\quad \theta \leftarrow \theta - \eta\, \frac{\hat{m}}{\sqrt{\hat{s}} + \epsilon}

$g$: the gradient of the loss at the current point.
$v$: momentum’s velocity — an exponential average of past gradients.
$\hat{m},\ \hat{s}$: Adam’s bias-corrected first and second moment estimates.

▸ Show the derivation

Momentum’s β (here 0.9) means each step remembers ~10 previous gradients, cancelling the side-to-side oscillation in a ravine while reinforcing the consistent downhill direction. Adam divides by √ŝ, so a direction with large gradients gets a smaller effective step — which is why it handles badly-scaled surfaces that cripple plain SGD.

Now Break It

Try this: Raise the learning rate until all three diverge; pick a saddle where plain SGD stalls.

Control: Learning-rate slider (set high) / surface picker (saddle)

← Back to all visualizations Continue on the Learning Path →

Last updated July 3, 2026.