Skip to content
ML Visualization

Optimizers (SGD · Momentum · Adam)

FoundationsIntermediate~8 min

Optimizers (SGD · Momentum · Adam)Optimizers are the update rules that drive gradient descent. Stochastic gradient descent steps on noisy mini-batch gradients; momentum accumulates velocity to power through ravines; Adam adapts a per-parameter step size. They differ most on hard surfaces like ravines and saddles.

Plain gradient descent zig-zags through narrow valleys and stalls on plateaus. Momentum gives it inertia like a rolling ball; Adam adapts each parameter’s step size automatically. Race all three down the same surface and the differences are obvious.

  • SGD
  • Momentum
  • Adam
Iteration 0 / 120
Surface
0.080

The idea in plain words

Plain gradient descent always steps straight downhill, which zig-zags painfully across a narrow valley. Optimizers change the update rule. Momentum accumulates velocity like a rolling ball, powering through ravines. Adam adapts a separate step size for each direction, so steep and shallow axes both move sensibly.

Race all three down the same surface and the differences are obvious: on a ravine, SGD stutters, Momentum overshoots and recovers, Adam glides. On a saddle, plain SGD can stall where the gradient nearly vanishes.

Now, the math

Each optimizer transforms the raw gradient before stepping:

SGD:θθηg\text{SGD:}\quad \theta \leftarrow \theta - \eta\, g
Momentum:vβv+g,θθηv\text{Momentum:}\quad v \leftarrow \beta v + g,\quad \theta \leftarrow \theta - \eta\, v
Adam:θθηm^s^+ϵ\text{Adam:}\quad \theta \leftarrow \theta - \eta\, \frac{\hat{m}}{\sqrt{\hat{s}} + \epsilon}
gg
the gradient of the loss at the current point.
vv
momentum’s velocity — an exponential average of past gradients.
m^, s^\hat{m},\ \hat{s}
Adam’s bias-corrected first and second moment estimates.
Show the derivation

Momentum’s β (here 0.9) means each step remembers ~10 previous gradients, cancelling the side-to-side oscillation in a ravine while reinforcing the consistent downhill direction. Adam divides by √ŝ, so a direction with large gradients gets a smaller effective step — which is why it handles badly-scaled surfaces that cripple plain SGD.

Now Break It

Try this: Raise the learning rate until all three diverge; pick a saddle where plain SGD stalls.

Control: Learning-rate slider (set high) / surface picker (saddle)

Last updated .