Feature Scaling
Feature Scaling — Put features on the same scale so no one dominates.
If one feature ranges 0–1 and another 0–10,000, distance- and gradient-based models get dominated by the big one. Scaling puts every feature on equal footing.
The identical algorithm with the same learning rate zig-zags hopelessly on the skewed (unscaled) contours but walks straight to the minimum on the circular (scaled) ones. Crank the mismatch and the unscaled side effectively can’t converge. Drag the highlighted start dot to relaunch descent from anywhere.
The identical algorithm with the same learning rate zig-zags hopelessly on the skewed (unscaled) contours but walks straight to the minimum on the circular (scaled) ones. Crank the mismatch and the unscaled side effectively can’t converge. Drag the highlighted start dot to relaunch descent from anywhere.
The idea in plain words
If one feature ranges 0–1 and another 0–10,000, the big one dominates any distance- or gradient-based model. Scaling puts every feature on equal footing. On the loss surface, unscaled features make skewed, stretched contours; scaled features make near-circular ones.
The identical gradient descent zig-zags hopelessly across the skewed valley but walks straight down the circular one — same math, opposite outcome. It’s why scaling matters for kNN and PCA.
Now, the math
Standardization rescales each feature to zero mean and unit variance:
- the feature’s mean.
- its standard deviation.
▸ Show the derivation
The convergence speed of gradient descent depends on the condition number of the loss (the ratio of largest to smallest curvature). Unequal feature scales inflate that ratio, forcing tiny steps along the steep axis; standardizing equalizes the curvatures, so a single learning rate works in every direction.
Now Break It
Try this: Unscaled features make distance-based methods obsess over the large-magnitude feature.
Control: Scaling toggle (turn off)
Last updated .