Loss Functions

FoundationsBeginner~5 min

Loss Functions — A loss function quantifies the disagreement between a model’s predictions and the true values as a single number. Training minimizes this number; the choice of loss (such as MSE or MAE) determines how errors are penalized.

How do you measure how wrong your model is? That’s what a loss function does — it takes the gap between prediction and reality and turns it into a single number. Smaller is better.

Data points
Prediction
Residuals

Prediction6.00

MSE (mean squared error)0.77

MAE (mean absolute error)0.75

Drag a data point far away to create an outlier — watch MSE explode while MAE barely moves.

Prediction6.00

MSE (mean squared error)0.77

MAE (mean absolute error)0.75

Drag a data point far away to create an outlier — watch MSE explode while MAE barely moves.

The idea in plain words

A loss function scores how wrong the model is, as a single number to minimize. Drag the prediction line and watch both scores move. Then drag one point far away: mean squared error explodes while mean absolute error barely flinches — because squaring turns a big miss into a huge one.

Once you can measure “wrong,” gradient descent minimizes it, and linear regression is the classic model built on squared-error loss.

Now, the math

Two common regression losses over n points:

\mathrm{MSE} = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - \hat{y}_i\bigr)^2

\mathrm{MAE} = \frac{1}{n}\sum_{i=1}^{n}\bigl|y_i - \hat{y}_i\bigr|

$y_i$: the actual value of point i.
$\hat{y}_i$: the model’s prediction for point i.
$n$: the number of points.
$\sum$: add up over all points.

▸ Show the derivation

Because MSE squares each error, an error of 6 contributes 36 while an error of 1 contributes 1 — so a single far-off outlier dominates the total. MAE grows only linearly, making it far more robust to outliers. The right choice depends on whether large errors should be punished disproportionately.

Now Break It

Try this: One extreme outlier explodes MSE but barely moves MAE — showing why loss function choice matters.

Control: Drag an outlier point far from the cluster

← Back to all visualizations Continue on the Learning Path →

Last updated July 3, 2026.