Skip to content
ML Visualization

Feature Scaling

Data Prep & Model EvaluationBeginner~5 min

Feature ScalingPut features on the same scale so no one dominates.

If one feature ranges 0–1 and another 0–10,000, distance- and gradient-based models get dominated by the big one. Scaling puts every feature on equal footing.

Unscaled — 26 steps
Scaled — 16 steps

The identical algorithm with the same learning rate zig-zags hopelessly on the skewed (unscaled) contours but walks straight to the minimum on the circular (scaled) ones. Crank the mismatch and the unscaled side effectively can’t converge. Drag the highlighted start dot to relaunch descent from anywhere.

The idea in plain words

If one feature ranges 0–1 and another 0–10,000, the big one dominates any distance- or gradient-based model. Scaling puts every feature on equal footing. On the loss surface, unscaled features make skewed, stretched contours; scaled features make near-circular ones.

The identical gradient descent zig-zags hopelessly across the skewed valley but walks straight down the circular one — same math, opposite outcome. It’s why scaling matters for kNN and PCA.

Now, the math

Standardization rescales each feature to zero mean and unit variance:

z=xμσz = \frac{x - \mu}{\sigma}
μ\mu
the feature’s mean.
σ\sigma
its standard deviation.
Show the derivation

The convergence speed of gradient descent depends on the condition number of the loss (the ratio of largest to smallest curvature). Unequal feature scales inflate that ratio, forcing tiny steps along the steep axis; standardizing equalizes the curvatures, so a single learning rate works in every direction.

Now Break It

Try this: Unscaled features make distance-based methods obsess over the large-magnitude feature.

Control: Scaling toggle (turn off)

Last updated .