Feature Scaling

Data Prep & Model EvaluationBeginner~5 min

Feature Scaling — Put features on the same scale so no one dominates.

If one feature ranges 0–1 and another 0–10,000, distance- and gradient-based models get dominated by the big one. Scaling puts every feature on equal footing.

Unscaled — 26 steps

Scaled — 16 steps

Feature-scale mismatch6×

The identical algorithm with the same learning rate zig-zags hopelessly on the skewed (unscaled) contours but walks straight to the minimum on the circular (scaled) ones. Crank the mismatch and the unscaled side effectively can’t converge. Drag the highlighted start dot to relaunch descent from anywhere.

Feature-scale mismatch6×

The idea in plain words

If one feature ranges 0–1 and another 0–10,000, the big one dominates any distance- or gradient-based model. Scaling puts every feature on equal footing. On the loss surface, unscaled features make skewed, stretched contours; scaled features make near-circular ones.

The identical gradient descent zig-zags hopelessly across the skewed valley but walks straight down the circular one — same math, opposite outcome. It’s why scaling matters for kNN and PCA.

Now, the math

Standardization rescales each feature to zero mean and unit variance:

z = \frac{x - \mu}{\sigma}

$\mu$: the feature’s mean.
$\sigma$: its standard deviation.

▸ Show the derivation

The convergence speed of gradient descent depends on the condition number of the loss (the ratio of largest to smallest curvature). Unequal feature scales inflate that ratio, forcing tiny steps along the steep axis; standardizing equalizes the curvatures, so a single learning rate works in every direction.

Now Break It

Try this: Unscaled features make distance-based methods obsess over the large-magnitude feature.

Control: Scaling toggle (turn off)

← Back to all visualizations Continue on the Learning Path →

Last updated July 3, 2026.