Activation Functions
Activation Functions — The nonlinearity that lets networks bend.
Without a nonlinear activation, stacking layers is pointless — it collapses back to one linear map. Activations like ReLU and sigmoid are what let networks learn curves and complex boundaries.
- Activation
- Derivative (gradient)
- Input x
Sigmoid’s derivative flatlines at the extremes; ReLU’s stays at 1 for positive inputs. Stack many sigmoids and the gradient product shrinks toward zero — the vanishing-gradient problem.
Sigmoid’s derivative flatlines at the extremes; ReLU’s stays at 1 for positive inputs. Stack many sigmoids and the gradient product shrinks toward zero — the vanishing-gradient problem.
The idea in plain words
Without a nonlinear activation, stacking layers is pointless — the whole network collapses back to a single linear map. Activations like ReLU and sigmoid are what let a network learn curves and complex boundaries.
Each function’s derivative is what training actually uses. Sigmoid’s derivative flatlines at the extremes while ReLU’s stays alive — the visual root cause of vanishing gradients in deep networks.
Now, the math
Common activations and their behavior at the extremes:
- the sigmoid gradient — peaks at 0.25, vanishes for large |x|.
- exactly 1 for positive inputs — no shrinkage.
▸ Show the derivation
During backprop, the gradient is multiplied by the activation’s derivative at every layer. Sigmoid derivatives are at most 0.25, so through many layers the product shrinks exponentially toward zero. ReLU’s derivative of 1 preserves the signal — the main reason it replaced sigmoid in deep nets.
Now Break It
Try this: Sigmoid saturates at the extremes, its gradient vanishing — deep sigmoid nets barely learn.
Control: Activation selector (sigmoid) with large inputs
Last updated .