Loss Functions
Loss Functions — A loss function quantifies the disagreement between a model’s predictions and the true values as a single number. Training minimizes this number; the choice of loss (such as MSE or MAE) determines how errors are penalized.
How do you measure how wrong your model is? That’s what a loss function does — it takes the gap between prediction and reality and turns it into a single number. Smaller is better.
- Data points
- Prediction
- Residuals
Drag a data point far away to create an outlier — watch MSE explode while MAE barely moves.
Drag a data point far away to create an outlier — watch MSE explode while MAE barely moves.
The idea in plain words
A loss function scores how wrong the model is, as a single number to minimize. Drag the prediction line and watch both scores move. Then drag one point far away: mean squared error explodes while mean absolute error barely flinches — because squaring turns a big miss into a huge one.
Once you can measure “wrong,” gradient descent minimizes it, and linear regression is the classic model built on squared-error loss.
Now, the math
Two common regression losses over n points:
- the actual value of point i.
- the model’s prediction for point i.
- the number of points.
- add up over all points.
▸ Show the derivation
Because MSE squares each error, an error of 6 contributes 36 while an error of 1 contributes 1 — so a single far-off outlier dominates the total. MAE grows only linearly, making it far more robust to outliers. The right choice depends on whether large errors should be punished disproportionately.
Now Break It
Try this: One extreme outlier explodes MSE but barely moves MAE — showing why loss function choice matters.
Control: Drag an outlier point far from the cluster
Last updated .