Backpropagation

Neural NetworksAdvanced~10 min

Backpropagation — Propagate error gradients backward to update every weight.

Backpropagation is how networks learn. It sends the output error backward through the layers using the chain rule, computing how much each weight contributed to the mistake, then nudges every weight to do better.

Forward pass — activations light up

Gradient magnitude by layer (output → input)

0.005

0.027

0.128

0.524

Iteration 0 / 8

Hidden layers (depth)3

Activation

The forward pass lights activations; the backward pass flows the error gradient back edge by edge via the chain rule. Make the net deep with sigmoids and the early-layer gradients dim to nothing — vanishing gradients you can see. Switch to ReLU to revive them.

Iteration 0 / 8

Hidden layers (depth)3

Activation

The idea in plain words

Watch the error flow back edge by edge. Make the network deep with sigmoids and the early-layer gradients dim to almost nothing — the vanishing-gradient problem, visible in the shrinking bars. Switch to ReLU to revive them.

Now, the math

The gradient for each weight is a local product, assembled by the chain rule:

\frac{\partial L}{\partial w^{(l)}_{ij}} = \delta^{(l)}_j \, a^{(l-1)}_i

$\delta^{(l)}_j$: the error signal at neuron j in layer l, propagated from the output.
$a^{(l-1)}_i$: the activation that fed into that weight on the forward pass.

▸ Show the derivation

Each δ is the next layer’s δ times the local weight times the activation derivative. Because those derivatives (≤ 0.25 for sigmoid) multiply at every layer, the error signal shrinks exponentially as it travels back, so early layers of deep sigmoid networks barely update. ReLU’s derivative of 1 keeps the signal alive.

Now Break It

Try this: In a deep sigmoid net the backward gradients shrink toward zero — early layers barely update.

Control: Depth slider with sigmoid activations

← Back to all visualizations Continue on the Learning Path →

Last updated July 3, 2026.