Optimizers (SGD · Momentum · Adam)
Optimizers (SGD · Momentum · Adam) — Optimizers are the update rules that drive gradient descent. Stochastic gradient descent steps on noisy mini-batch gradients; momentum accumulates velocity to power through ravines; Adam adapts a per-parameter step size. They differ most on hard surfaces like ravines and saddles.
Plain gradient descent zig-zags through narrow valleys and stalls on plateaus. Momentum gives it inertia like a rolling ball; Adam adapts each parameter’s step size automatically. Race all three down the same surface and the differences are obvious.
- SGD
- Momentum
- Adam
The idea in plain words
Plain gradient descent always steps straight downhill, which zig-zags painfully across a narrow valley. Optimizers change the update rule. Momentum accumulates velocity like a rolling ball, powering through ravines. Adam adapts a separate step size for each direction, so steep and shallow axes both move sensibly.
Race all three down the same surface and the differences are obvious: on a ravine, SGD stutters, Momentum overshoots and recovers, Adam glides. On a saddle, plain SGD can stall where the gradient nearly vanishes.
Now, the math
Each optimizer transforms the raw gradient before stepping:
- the gradient of the loss at the current point.
- momentum’s velocity — an exponential average of past gradients.
- Adam’s bias-corrected first and second moment estimates.
▸ Show the derivation
Momentum’s β (here 0.9) means each step remembers ~10 previous gradients, cancelling the side-to-side oscillation in a ravine while reinforcing the consistent downhill direction. Adam divides by √ŝ, so a direction with large gradients gets a smaller effective step — which is why it handles badly-scaled surfaces that cripple plain SGD.
Now Break It
Try this: Raise the learning rate until all three diverge; pick a saddle where plain SGD stalls.
Control: Learning-rate slider (set high) / surface picker (saddle)
Last updated .