This weekend I decided to dive into the Categorical Feature Encoding Challenge II that Kaggle is hosting so I could learn some things about using categorical features in machine learning models.
For this challenge, the dataset contained only categorical features and also different kinds of categorical features: Binary, nominal, and ordinal. On top of that, some of the features had low cardinality, while other features had high cardinality.
In the context of machine learning, “cardinality” refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. Dealing with high cardinality turned out to be one of the most interesting parts of the challenge. It’s a problem that I haven’t encountered much before, certainly not to this extent.
There was a wide range of cardinality. The binary features, of course, could only assume one of two values (0 or 1). One of the ordinal features was uppercase letters of the alphabet, so it had 26 possible values. A feature that really took the cake had more than 1000 possible alpha-numeric strings like
de4c57ee2. And not only that, but some of the values were unique – only appearing once in the training set. Improper encoding of high-cardinality features can lead to poor model performance, and memory issues when trying to fit the model because encoding can result in extremely large matrices of values.
Dummy / one-hot encoding
A common strategy for encoding categorical variables is “one-hot” or “dummy” encoding, where a new column is created for every N (or N-1) of the possible values, and those columns are given a value of 1 or 0 depending on whether the feature had that value. Below is an example of dummy encoding for 4 hair colors:
Dummy / one-hot encoded variable with N columns:
You can imagine how dummy-coding 1000s of possible values might get out of hand, especially when some of them are very rare. This leads to the problem of “sparsity” where you end up with a huge matrix and almost every value is zero. In this dataset, one of my models was running over essentially 600000 X 5000 (rows X columns).
Fortunately there is a “compressed sparse row” matrix representation that solves this problem, which is part of the
scikit library of methods. The graphic below (source) illustrates how the sparse data is encoded into the matrix.
Target (mean) encoding
The other approach side-steps this problem entirely using a process called “target encoding”. Target encoding is when you calculate the mean of the target variable (in the training dataset) and use that in place of the categorical variable. Take the table below for example. The mean of the target variable for
0.5. We could substitute this mean for every time we see
red, and use this target-encoded variable as a predictor of the target (in the test set).
However, when you use target encoding you run the risk of “target leakage” because your model might then be using information about the target to predict the target. It’s similar to how your model might be overfit if you trained and validated a model using the same data (instead of using the conventional “train set” / “test set” split). This can ultimately result in an overfit model that performs poorly out-of-sample. Leakage can be avoided by using a “leave one out” target encoder, which encodes the target value while excluding the current row that it’s looking at.
The other challenge is finding the right model given how the data is encoded. I built a Logistic Regression model using a sparse one-hot encoded matrix for the high-cardinality features, and a CatBoost model using target encoding for the high-cardinality features.
My best model scored
0.78577 which is not far off from the #1 score of