Activation Functions

Neural NetworksIntermediate~6 min

Activation Functions — The nonlinearity that lets networks bend.

Without a nonlinear activation, stacking layers is pointless — it collapses back to one linear map. Activations like ReLU and sigmoid are what let networks learn curves and complex boundaries.

Function f(x)

Derivative f′(x)

Activation
Derivative (gradient)
Input x

Function

Input x1.5

Stacked layers (depth)1

Sigmoid’s derivative flatlines at the extremes; ReLU’s stays at 1 for positive inputs. Stack many sigmoids and the gradient product shrinks toward zero — the vanishing-gradient problem.

Function

Input x1.5

Stacked layers (depth)1

Sigmoid’s derivative flatlines at the extremes; ReLU’s stays at 1 for positive inputs. Stack many sigmoids and the gradient product shrinks toward zero — the vanishing-gradient problem.

The idea in plain words

Without a nonlinear activation, stacking layers is pointless — the whole network collapses back to a single linear map. Activations like ReLU and sigmoid are what let a network learn curves and complex boundaries.

Each function’s derivative is what training actually uses. Sigmoid’s derivative flatlines at the extremes while ReLU’s stays alive — the visual root cause of vanishing gradients in deep networks.

Now, the math

Common activations and their behavior at the extremes:

\sigma(x) = \frac{1}{1+e^{-x}},\qquad \text{ReLU}(x) = \max(0, x)

$\sigma'(x)$: the sigmoid gradient — peaks at 0.25, vanishes for large |x|.
$\text{ReLU}'(x)$: exactly 1 for positive inputs — no shrinkage.

▸ Show the derivation

During backprop, the gradient is multiplied by the activation’s derivative at every layer. Sigmoid derivatives are at most 0.25, so through many layers the product shrinks exponentially toward zero. ReLU’s derivative of 1 preserves the signal — the main reason it replaced sigmoid in deep nets.

Now Break It

Try this: Sigmoid saturates at the extremes, its gradient vanishing — deep sigmoid nets barely learn.

Control: Activation selector (sigmoid) with large inputs

← Back to all visualizations Continue on the Learning Path →

Last updated July 3, 2026.