Skip to content
ML Visualization

Backpropagation

Neural NetworksAdvanced~10 min

BackpropagationPropagate error gradients backward to update every weight.

Backpropagation is how networks learn. It sends the output error backward through the layers using the chain rule, computing how much each weight contributed to the mistake, then nudges every weight to do better.

Forward pass — activations light up

Gradient magnitude by layer (output → input)
0.005
0.027
0.128
0.524
Iteration 0 / 8
3
Activation

The forward pass lights activations; the backward pass flows the error gradient back edge by edge via the chain rule. Make the net deep with sigmoids and the early-layer gradients dim to nothing — vanishing gradients you can see. Switch to ReLU to revive them.

The idea in plain words

Backpropagation is how networks learn. It sends the output error backward through the layers using the chain rule, computing how much each weight contributed to the mistake, then nudges every weight to do better — the same gradient descent, wired through the net.

Watch the error flow back edge by edge. Make the network deep with sigmoids and the early-layer gradients dim to almost nothing — the vanishing-gradient problem, visible in the shrinking bars. Switch to ReLU to revive them.

Now, the math

The gradient for each weight is a local product, assembled by the chain rule:

Lwij(l)=δj(l)ai(l1)\frac{\partial L}{\partial w^{(l)}_{ij}} = \delta^{(l)}_j \, a^{(l-1)}_i
δj(l)\delta^{(l)}_j
the error signal at neuron j in layer l, propagated from the output.
ai(l1)a^{(l-1)}_i
the activation that fed into that weight on the forward pass.
Show the derivation

Each δ is the next layer’s δ times the local weight times the activation derivative. Because those derivatives (≤ 0.25 for sigmoid) multiply at every layer, the error signal shrinks exponentially as it travels back, so early layers of deep sigmoid networks barely update. ReLU’s derivative of 1 keeps the signal alive.

Now Break It

Try this: In a deep sigmoid net the backward gradients shrink toward zero — early layers barely update.

Control: Depth slider with sigmoid activations

Last updated .