Encoding Categorical Features
Encoding Categorical Features — Turn categories into numbers models can use.
Models eat numbers, not words. Encoding turns categories like “red, green, blue” into numeric form — but the wrong encoding can invent a fake ordering that misleads the model.
Fake ordering! Label encoding tells the model red < green < blue — an order that doesn’t exist.
Models eat numbers, not words. Label encoding tells the model Red < Green < Blue and that Red↔Blue is twice Red↔Green — an order that doesn’t exist. One-hot gives each category its own column, so all are equidistant.
Models eat numbers, not words. Label encoding tells the model Red < Green < Blue and that Red↔Blue is twice Red↔Green — an order that doesn’t exist. One-hot gives each category its own column, so all are equidistant.
The idea in plain words
Models eat numbers, not words. Encoding turns categories like “Red, Green, Blue” into numeric form — but the wrong choice invents structure that isn’t there. Label encoding assigns 0, 1, 2, which tells the model the categories are ordered and evenly spaced.
One-hot encoding instead gives each category its own binary column, so every pair is equally distant. For unordered categories that’s the honest representation.
Now, the math
Under label encoding the model reads a false distance and ordering:
One-hot makes every distinct pair equidistant:
▸ Show the derivation
A linear model multiplies the encoded value by a weight, so label codes force the effect of “Blue” to be exactly twice that of “Green.” One-hot lets each category get its own independent weight, removing the artificial order — at the cost of one extra column per category.
Now Break It
Try this: Label encoding unordered categories invents a fake numeric order the model treats as meaningful.
Control: Encoding selector (set to label encoding)
Last updated .