Skip to content
ML Visualization

Ridge Regression (L2)

RegressionIntermediate~7 min

Ridge Regression (L2)Shrink coefficients toward zero to reduce variance.

Ridge regression adds a penalty for large coefficients. It gently shrinks every weight toward zero, trading a little bias for a big drop in variance — taming wild overfit models.

Coefficient path — drag to set λ
  • x1
  • x2
  • x3
  • x4
  • x5
  • x6
L2 constraint (circle)
0.100
Largest |w|1.786
Non-zero6 / 6

The idea in plain words

Ridge regression adds a price for large coefficients. Instead of only minimizing error, it minimizes error plus the summed squares of the weights, so the fit trades a little bias for a big drop in variance — taming the wild swings a high-degree fit is prone to.

Turn the penalty λ up and every coefficient shrinks smoothly toward zero, but none ever reaches it exactly. Geometrically, the round L2 constraint has no corners for a coefficient to snap to — the crucial difference from lasso.

Now, the math

Ridge minimizes squared error plus an L2 penalty on the weights:

J=1ni(yiy^i)2+λjwj2J = \tfrac{1}{n}\sum_i (y_i - \hat{y}_i)^2 + \lambda \sum_j w_j^2

This still has a closed form — just a nudged normal equation:

w=(XX+λI)1Xy\mathbf{w} = (X^\top X + \lambda I)^{-1} X^\top \mathbf{y}
λ\lambda
the regularization strength — how hard large weights are penalized.
jwj2\sum_j w_j^2
the L2 penalty — the squared length of the weight vector.
λI\lambda I
the ridge added to the diagonal, which also fixes ill-conditioning.
Show the derivation

Adding λI to XX shifts every eigenvalue up by λ, so directions of low data variance (which cause instability) are damped most. As λ → ∞ the solution collapses toward the all-zero vector and the model predicts the mean of y — the failure you can drive with the slider.

Now Break It

Try this: Enormous λ crushes every coefficient to near zero — the model becomes a flat line ignoring the data.

Control: Lambda slider (set to maximum)

Last updated .