Beyond One-Hot: An Exploration of Categorical Variables

Exploring Ideas: A Blog on Technology, Startups, Food, and More

In machine learning, data are king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables.

Categorical variables (wiki) are those that represent a fixed number of possible values, rather than a continuous number. Each value assigns the measurement to one of those finite groups, or categories. They differ from ordinal variables in that the distance from one category to another ought to be equal regardless of the number of categories, as opposed to ordinal variables which have some intrinsic ordering. As an example:

Ordinal: low, medium, high
Categorical: Georgia, Alabama, South Carolina, … , New York

The machine learning algorithms we will later use tend to want numbers, and not strings, as their inputs so we need some method of coding to convert them.

A quick interjection: there is one other concept that will come up frequently in this post, and that is the concept of dimensionality. In simplistic terms, it is just the number of columns in the dataset, but it has significant downstream effects on the eventual models. At the extremes, the concept of the “curse of dimensionality” discusses that in high-dimensional spaces some things just stop working properly. Even in relatively low dimensional problems, a dataset with more dimensions requires more parameters for the model to understand, and that means more rows to reliably learn those parameters. If the number of rows in the dataset is fixed, addition of extra dimensions without adding more information for the models to learn from can have a detrimental effect on the eventual model accuracy.

To circle back to the problem at hand: we want to code categorical variables into numbers, but we are concerned about this dimensionality problem. The obvious answer is to just assign an integer to each category (we are assuming we know all of the possible categories up front). This is called ordinal coding. It does not add any dimensions to the problem, but implies an order to the variable that may not actually exist.

Methodology

To find out how well this works, I put together a simple python script to test different coding methods on common datasets. First an overview of the process:

We gather a dataset for a classification problem that has categorical variables
We use some method of coding to convert the X dataset into numeric values
We use scikit-learn’s cross-validation-score and a BernoulliNB() classifier to generate scores for the dataset. This is repeated 10x for each dataset and the mean of all scores is used.
We store the dimensionality of the dataset, mean score, and time to code the data and generate the scores.
This is all repeated for a few different datasets from the UCI dataset repository:
- Car Evaluation
- Mushrooms
- Splice Junctions

I tried 7 different encoding methods (descriptions of 4-7 are taken from statsmodel’s docs):

Ordinal: as described above
One-Hot: one column per category, with a 1 or 0 in each cell for if the row contained that column’s category
Binary: first the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns. This encodes the data in fewer dimensions that one-hot, but with some distortion of the distances. Not all categories will be same distance from each other, with no real rhyme or reason to that difference (but not quite as bad as ordinal).
Sum: compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others.
Polynomial: The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing.
Backward Difference: the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.
Helmert: The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name ‘reverse’ being sometimes applied to differentiate from forward Helmert coding.

Results

Mushroom

Coding	Dataset	Dimensionality	Avg. Score	Elapsed Time
Ordinal	Mushroom	22	0.810919	3.653194
One-Hot Encoded	Mushroom	117	0.813252	8.193983
Helmert Coding	Mushroom	117	0.837997	5.431131
Backward Difference Coding	Mushroom	117	0.846864	7.829706
Sum Coding	Mushroom	117	0.850555	4.929640
Polynomial Coding	Mushroom	117	0.855596	6.136916
Binary Encoded	Mushroom	43	0.871493	3.948484

Cars

Coding	Dataset	Dimensionality	Avg. Score	Elapsed Time
Sum Coding	Cars	21	0.549347	1.456738
Helmert Coding	Cars	21	0.577471	1.458556
Ordinal	Cars	6	0.638522	1.466667
One-Hot Encoded	Cars	21	0.648694	1.393966
Polynomial Coding	Cars	21	0.666130	1.495264
Backward Difference Coding	Cars	21	0.697274	1.499557
Binary Encoded	Cars	12	0.697911	1.441609

Splice

Coding	Dataset	Dimensionality	Avg. Score	Elapsed Time
Ordinal	Splice	61	0.681816	5.107389
Sum Coding	Splice	3465	0.922276	25.898854
Binary Encoded	Splice	134	0.935744	3.352499
One-Hot Encoded	Splice	3465	0.944839	2.563578

Conclusions

This is by no means an exhaustive study, but it seems that with decent consistency binary coding performs well, without a significant increase in dimensionality. Ordinal, as expected, performs consistently poorly.

** Note from the future:** Binary really does not make any sense vs the mathematically sound one-hot encoding or other more advnaced methods, other than in cases where there are so many categories that one-hot is impractical to compute. There cases of course exist, like if you are working with hundreds of thousands of fault codes. If you can fit one-hotted columns in memory and can do computation on them, then there’s no real reason not to do that, you’ll then be able to do useful things like feature selection. At the time of originally writing these, we were working on use cases with extremely high cardinality categorical features and in some cases it did actually work pretty well but I wouldn’t imagine many such cases exist.

If you’d like to look at the source, add or suggest new datasets, or new coding methods, I’ve put everything (including datasets) up on github: https://github.com/wdm0006/categorical_encoding.

Feel free to either contribute directly there, or comment with suggestions here.

Note: This exploration eventually led to the development of the category_encoders package, which provides scikit-learn compatible transformers for various encoding methods. The package has since been published in the Journal of Open Source Software and grown into a widely-used tool in the data science community.

Subscribe to the Newsletter

Get the latest posts and insights delivered straight to your inbox.