Skip to content
ML Visualization

Encoding Categorical Features

Data Prep & Model EvaluationBeginner~5 min

Encoding Categorical FeaturesTurn categories into numbers models can use.

Models eat numbers, not words. Encoding turns categories like “red, green, blue” into numeric form — but the wrong encoding can invent a fake ordering that misleads the model.

Distance the model perceives
RedGreenBlue
Red0.001.002.00
Green1.000.001.00
Blue2.001.000.00
Encoding

Models eat numbers, not words. Label encoding tells the model Red < Green < Blue and that Red↔Blue is twice Red↔Green — an order that doesn’t exist. One-hot gives each category its own column, so all are equidistant.

The idea in plain words

Models eat numbers, not words. Encoding turns categories like “Red, Green, Blue” into numeric form — but the wrong choice invents structure that isn’t there. Label encoding assigns 0, 1, 2, which tells the model the categories are ordered and evenly spaced.

One-hot encoding instead gives each category its own binary column, so every pair is equally distant. For unordered categories that’s the honest representation.

Now, the math

Under label encoding the model reads a false distance and ordering:

d(Red,Blue)=21=d(Red,Green)d(\text{Red},\text{Blue}) = 2 \neq 1 = d(\text{Red},\text{Green})

One-hot makes every distinct pair equidistant:

d(i,j)=2ijd(i, j) = \sqrt{2}\quad \forall\, i \neq j
Show the derivation

A linear model multiplies the encoded value by a weight, so label codes force the effect of “Blue” to be exactly twice that of “Green.” One-hot lets each category get its own independent weight, removing the artificial order — at the cost of one extra column per category.

Now Break It

Try this: Label encoding unordered categories invents a fake numeric order the model treats as meaningful.

Control: Encoding selector (set to label encoding)

Last updated .