Softmax & Multiclass

ClassificationIntermediate~6 min

Softmax & Multiclass — Softmax converts a vector of raw class scores (logits) into probabilities that sum to 1 by exponentiating and normalizing. A temperature parameter sharpens it toward a hard argmax or flattens it toward uniform.

Three or more classes carve the space into colored regions. A bar panel shows raw scores becoming probabilities — and a temperature dial slides softmax from a confident winner-take-all to a flat shrug.

Class 0
Class 1
Class 2

Softmax probabilities at the query

71%

14%

15%

Temperature T1.00

Classes

Softmax probabilities at the query

71%

14%

15%

Temperature T1.00

Classes

The idea in plain words

Softmax turns a handful of raw class scores into probabilities that sum to one, by exponentiating and normalizing. It’s the multiclass generalization of the sigmoid, and it powers the output layer of nearly every neural network classifier.

A temperature dial controls how peaked it is. Near zero it becomes a hard argmax — one class takes everything — so tiny changes flip the winner. Turn it up and the probabilities flatten toward a uniform shrug.

Now, the math

Softmax with temperature T:

\text{softmax}(z_i) = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}

$z_i$: the raw score (logit) for class i.
$T$: temperature — low sharpens toward argmax, high flattens toward uniform.

▸ Show the derivation

Dividing logits by T before exponentiating rescales the gaps between them. As T → 0 the largest logit dominates completely (probability 1); as T → ∞ all scaled logits approach 0 and the probabilities become equal. This same knob is used to calibrate confidence and to soften targets in model distillation.

Now Break It

Try this: Temperature near zero makes the classifier brittle — the winner flips on tiny changes.

Control: Temperature slider (set very low)

← Back to all visualizations Continue on the Learning Path →

Last updated July 3, 2026.