Encoding high cardinality features with “embeddings”

In this post I show how the performance of an ML model can be improved by encoding high cardinality features using “embeddings”, a method that uses deep learning to represent categorical features as vectors. I compare the performance of embedding encoding with other common categorical encoding methods: one-hot, label, frequency, and target encoding.

Embedding is categorical encoding method that that uses deep learning to represent categorical features as vectors. It’s particularly useful for categorical features with many levels, since it can be used to project high-dimensional features into low-dimensional space.

In this blog post, I’ll show how ML models with embedding encoding outperform models with other common categorical encoding methods (frequency, label, one-hot, and target). For this demonstration, I’ll be using the dataset from Kaggle’s Playground Series S3E22: Predict Health Outcomes of Horses.

Looking at the character columns, I see some case inconsistency (i.e., some columns have both “None” and “none”). Converting all strings to lowercase would help to at least combine the “none” types.

Before diving into categorical encoding methods, I’ll do a train/test split. I’ll also convert the character columns to lowercase to address the problem I mentioned above (“None” vs. “none”), and I’ll convert hospital_number to categorical.

recipe_1hot_with_novel<-recipe(outcome~., data =train%>%select(-id))%>%step_normalize(all_numeric_predictors())%>%step_novel(all_nominal_predictors(), new_level ="NA")%>%step_dummy(all_nominal_predictors(), one_hot=T)

The first – and probably most popular – type of categorical encoding is one-hot encoding. One-hot encoding transforms a single categorical variable with N levels into binary variables encoding each of the N levels.

For example, age is a categorical variable with 2 levels.

When age is one-hot encoded, a column is created for each level to encode the value (e.g., if the original value was adult, then the age_adult column gets a 1 and the other columns get a 0). And since I’ve also included a step to encode novel levels as NA, there is also a third column for that.

recipe_label<-recipe(outcome~., data =train%>%select(-id))%>%step_normalize(all_numeric_predictors())%>%step_integer(all_nominal_predictors())

With label encoding, each level of the categorical variable is given an (arbitrary) number. In the tidymodels framework, step_integer works like scikit’s LabelEncoder, and encodes new values as zero. Here we see that one level of age was encoded as “1” and the other was encoded as “2”.

With frequency encoding, levels of the categorical variable are replaced with their frequency. Here, we can see how the levels of age have been replaced with their frequency in the training set. (When this is applied to the test set, these same training frequencies will be used.)

recipe_freq<-recipe(outcome~., data =train_freq%>%select(-id))%>%step_normalize(all_numeric_predictors())

Target encoding

For target encoding (also called “effect encoding” or “likelihood encoding”), I’ll be using the h2o package because it supports multi-class targets. (The embed package can also do target encoding and integrates better with a tidymodels workflow, but at the moment it only supports binary targets.)

Using h2o requires some additional setup.

# Convert to h2o formatdf_h2o<-as.h2o(df)# Split the dataset into train and testsplits_h2o<-h2o.splitFrame(data =df_h2o, ratios =.8, seed =42)train_h2o<-splits_h2o[[1]]test_h2o<-splits_h2o[[2]]

With target encoding, the levels of the categorical variable are replaced with their mean value on the target. For example, if the level “young” was associated with a mean target value of 0.75, then this is the value with which that level would be replaced.

Because the outcome is being used for encoding, care needs to be taken when using this method to avoid leakage and overfitting. In this case, I’ll use the “Leave One Out” method: for each row, the mean is calculated over all rows excluding that row.

# Choose which columns to encodeencode_columns<-colnames(df%>%select(where(is.factor), -outcome))# All categorical variables# Train a TE modelte_model<-h2o.targetencoder(x =encode_columns, y ='outcome', keep_original_categorical_columns=T, training_frame =train_h2o, noise=0, seed=100, blending =T, # Blending helps with levels that are more rare data_leakage_handling ="LeaveOneOut")# New target encoded training and test datasetstrain_te<-h2o.transform(te_model, train_h2o)test_te<-h2o.transform(te_model, test_h2o)

Here we can see how the target encoding strategy encoded age: Two new variables are created, age_euthanized_te and age_lived_te. The encoded values represent the proportion of cases that were euthanized, or lived, for each level of age. (Note: The “died” level of the outcome variable is missing. This is because if we know the proportion that were euthanized and lived, we also know the proportion that died.)

# Create a recipe to use laterrecipe_target<-recipe(outcome~., data =train_te%>%as.data.frame()%>%select(-id))%>%step_normalize(all_numeric_predictors())

Embedding encoding

Embedding is categorical encoding method that that uses deep learning to represent categorical features as vectors. It’s particularly useful for categorical features with many levels, since it can be used to project high-dimensional features into low-dimensional space.

I want to use embedding strategically. Since embeddings are particularly useful to project into lower-dimensional space, this means it’s going to be most useful for categorical variables that have many levels. For variables with fewer than 3 levels, I’ll use one-hot encoding. For variables with more than 3 levels, I’ll use embeddings and project them onto 3 levels. I’ll project hospital_number onto 50 levels, and lesion_1 onto 25 levels. (This is somewhat arbitrary; I did a quick few tests – not shown here – to arrive at these numbers.)

I’ll define 3 models to evaluate: multinomial logistic regression, random forest, and xgboost.

multinom_mod<-multinom_reg()%>%# Need to bump the max weights, otherwise it won't runset_engine("nnet", MaxNWts =10000)%>%set_mode("classification")ranger_mod<-rand_forest(trees=1000)%>%set_engine("ranger")%>%set_mode("classification")xgboost_mod<-boost_tree(trees=50)%>%set_engine("xgboost")%>%set_mode("classification")

Model fit function

With 3 models and 5 categorical encodings, I’ll need to fit 15 models. To streamline this process, I’ll define two functions:

fit_model(): Given training and test datasets, a workflow containing a recipe for the categorical encoding, a model type, and an encoding type, this function will evaluate the model in-sample using cross-validation, then evaluate it out-of-sample, and then return a dataframe containing the results

fit_encodings(): Given a model and model type, this function will generate recipes for each of the 5 categorical encodings, fit the 5 encodings using the model, and then return a dataframe with the results

fit_model<-function(train, test, workflow, model_type, encoding_type){set.seed(42)folds<-vfold_cv(train, v =5)resampled_fit<-workflow%>%fit_resamples(folds, metrics =metric_set(f_meas))# Get in-sample F1(resampled_fit%>%collect_metrics())$mean->train_perf# Get out-of-sample F1fit<-workflow%>%fit(train)test$pred<-predict(fit, test)$.pred_class(f_meas(test, outcome, pred, estimator='micro'))$.estimate->test_perf# Combine in-sample and out-of-sample into a dataframedf_perf<-data.frame(model_type =model_type, encoding_type =encoding_type, train_perf =train_perf, test_perf =test_perf)return(df_perf)}# Given a model, run it across the 4 encodings and return a dataframe that summarizes the resultsfit_encodings<-function(model, model_type){set.seed(42)tensorflow::set_random_seed(42)# One-hot encoded modelwflow_1hot<-workflow()%>%add_model(model)%>%add_recipe(recipe_1hot_with_novel)fit_model(train%>%select(-id), test%>%select(-id), wflow_1hot, model_type,'onehot')->onehot_model_results# Label encoded modelwflow_label<-workflow()%>%add_model(model)%>%add_recipe(recipe_label)fit_model(train%>%select(-id), test%>%select(-id), wflow_label, model_type,'label')->label_model_results# Frequency encoded modelwflow_freq<-workflow()%>%add_model(model)%>%add_recipe(recipe_freq)fit_model(train_freq%>%select(-id), test_freq%>%select(-id), wflow_freq, model_type,'frequency')->freq_model_results# Target encoded modelwflow_target<-workflow()%>%add_model(model)%>%add_recipe(recipe_target)fit_model(train_te%>%as.data.frame()%>%select(-id), test_te%>%as.data.frame()%>%select(-id), wflow_target, model_type,'target')->target_model_results# Embedding encoded modelwflow_embedding<-workflow()%>%add_model(model)%>%add_recipe(recipe_embedding)fit_model(train%>%as.data.frame()%>%select(-id), test%>%as.data.frame()%>%select(-id), wflow_embedding, model_type,'embedding')->embedding_model_results# Compile results into a dataframeonehot_model_results%>%bind_rows(label_model_results)%>%bind_rows(freq_model_results)%>%bind_rows(target_model_results)%>%bind_rows(embedding_model_results)->resultsresults}

I’ll run each of the models using the fit_encodings() and fit_model() functions that I just defined.

model_results%>%mutate(colors =ifelse(encoding_type=='embedding', '1', '0'))%>%ggplot()+geom_col(aes(x =model_type, group =encoding_type, fill =encoding_type, y =test_perf, color =colors), position='dodge')+scale_y_continuous(limits=c(0.60, 0.75), oob =rescale_none, breaks =seq(0.60, 0.75, 0.01))+labs(title ='F1 score by model type and categorical encoding method', subtitle ='The best models used embedding encoding', fill ='Encoding', y ='F1 score', x ='Model')+scale_color_manual(values=c("white", "black"))+guides(colour ="none")+theme_minimal()

---title: 'Encoding high cardinality features with "embeddings"'date: 2023-09-19description: "In this post I show how the performance of an ML model can be improved by encoding high cardinality features using \"embeddings\", a method that uses deep learning to represent categorical features as vectors. I compare the performance of embedding encoding with other common categorical encoding methods: one-hot, label, frequency, and target encoding."image: social-image.pngtwitter-card: image: "social-image.png"open-graph: image: "social-image.png"categories: - kaggle - categorical-encoding - machine-learning - Rfreeze: true---Embedding is categorical encoding method that that uses deep learning to represent categorical features as vectors. It's particularly useful for categorical features with many levels, since it can be used to project high-dimensional features into low-dimensional space.In this blog post, I'll show how ML models with embedding encoding outperform models with other common categorical encoding methods (frequency, label, one-hot, and target). For this demonstration, I'll be using the dataset from [Kaggle's Playground Series S3E22: Predict Health Outcomes of Horses](https://www.kaggle.com/competitions/playground-series-s3e22/data).# Load libraries```{r, message=F}library(tidyverse)library(tidymodels)library(lares)library(ranger)library(xgboost)library(tensorflow)library(keras)library(h2o)library(encodeR)h2o.init()h2o.no_progress()```# Load the data```{r}df <-read_csv('data/playground-series-s3e22/train.csv', show_col_types = F)```# Categorical variablesThis dataset contains 17 character columns.```{r, message=F}colnames(df[, sapply(df, class) =='character'])```However, there are also several numeric categorical columns: `hospital_number`, `lesion_1`, `lesion_2`, and `lesion_3`. ```{r}df %>%select(hospital_number, lesion_1, lesion_2, lesion_3) %>%head()````hospital_number` and `lesion_1` are of particular interest because they have so many levels.```{r}length(unique(df$hospital_number))length(unique(df$lesion_1))length(unique(df$lesion_2))length(unique(df$lesion_3))```Looking at the character columns, I see some case inconsistency (i.e., some columns have both "None" and "none"). Converting all strings to lowercase would help to at least combine the "none" types.```{r}char_cols <-colnames(df %>%select(where(is.character)))for(col in char_cols){print(paste0(toupper(col), ': ', paste0(distinct(df[col])[[1]], collapse=', ')))}```# Train/test splitBefore diving into categorical encoding methods, I'll do a train/test split. I'll also convert the character columns to lowercase to address the problem I mentioned above ("None" vs. "none"), and I'll convert `hospital_number` to categorical.```{r}set.seed(42)df %>%mutate_if(where(is.character), .funs=tolower) %>%mutate(outcome =as.factor(outcome)) %>%mutate(across(where(is.character), factor),hospital_number =as.factor(hospital_number),lesion_1 =as.factor(lesion_1),lesion_2 =as.factor(lesion_2),lesion_3 =as.factor(lesion_3)) -> dfsplit <-initial_split(df)train <-training(split)test <-testing(split)```# Categorical encoding## One-hot encoding```{r}recipe_1hot_with_novel <-recipe(outcome ~ ., data = train %>%select(-id)) %>%step_normalize(all_numeric_predictors()) %>%step_novel(all_nominal_predictors(), new_level ="NA") %>%step_dummy(all_nominal_predictors(), one_hot=T)```The first -- and probably most popular -- type of categorical encoding is one-hot encoding. One-hot encoding transforms a single categorical variable with N levels into binary variables encoding each of the N levels.For example, `age` is a categorical variable with 2 levels.```{r}levels(train$age)length(levels(train$age))```When `age` is one-hot encoded, a column is created for each level to encode the value (e.g., if the original value was `adult`, then the `age_adult` column gets a 1 and the other columns get a 0). And since I've also included a step to encode novel levels as `NA`, there is also a third column for that.```{r}recipe_1hot_with_novel %>%prep() %>%bake(new_data =NULL) %>%select(starts_with('age')) %>%head(3)```## Label encoding```{r}recipe_label <-recipe(outcome ~ ., data = train %>%select(-id)) %>%step_normalize(all_numeric_predictors()) %>%step_integer(all_nominal_predictors())```With label encoding, each level of the categorical variable is given an (arbitrary) number. In the `tidymodels` framework, `step_integer` works like scikit's `LabelEncoder`, and encodes new values as zero. Here we see that one level of `age` was encoded as "1" and the other was encoded as "2".```{r}recipe_label %>%prep() %>%bake(new_data =NULL) %>%select(age) %>% distinct```## Frequency encoding```{r, message=F}freq_encoding <- encodeR::frequency_encoder(X_train = train,X_test = test, cat_columns =colnames(df %>%select(where(is.factor), -outcome)))train_freq <- freq_encoding$traintest_freq <- freq_encoding$test```With frequency encoding, levels of the categorical variable are replaced with their frequency. Here, we can see how the levels of `age` have been replaced with their frequency in the training set. (When this is applied to the test set, these same training frequencies will be used.)```{r}train_freq %>%select(age) %>%distinct()``````{r}recipe_freq <-recipe(outcome ~ ., data = train_freq %>%select(-id)) %>%step_normalize(all_numeric_predictors())```## Target encodingFor target encoding (also called "effect encoding" or "likelihood encoding"), I'll be using the `h2o` package because it supports multi-class targets. (The `embed` package can also do target encoding and integrates better with a tidymodels workflow, but at the moment it only supports binary targets.)Using `h2o` requires some additional setup.```{r}# Convert to h2o formatdf_h2o <-as.h2o(df)# Split the dataset into train and testsplits_h2o <-h2o.splitFrame(data = df_h2o, ratios = .8, seed =42)train_h2o <- splits_h2o[[1]]test_h2o <- splits_h2o[[2]]```With target encoding, the levels of the categorical variable are replaced with their mean value on the target. For example, if the level "young" was associated with a mean target value of 0.75, then this is the value with which that level would be replaced. Because the outcome is being used for encoding, care needs to be taken when using this method to avoid leakage and overfitting. In this case, I'll use the "Leave One Out" method: for each row, the mean is calculated over all rows excluding that row.```{r, message=F}# Choose which columns to encodeencode_columns <-colnames(df %>%select(where(is.factor), -outcome)) # All categorical variables# Train a TE modelte_model <-h2o.targetencoder(x = encode_columns,y ='outcome', keep_original_categorical_columns=T,training_frame = train_h2o,noise=0,seed=100,blending = T, # Blending helps with levels that are more raredata_leakage_handling ="LeaveOneOut")# New target encoded training and test datasetstrain_te <-h2o.transform(te_model, train_h2o)test_te <-h2o.transform(te_model, test_h2o)```Here we can see how the target encoding strategy encoded `age`: Two new variables are created, `age_euthanized_te` and `age_lived_te`. The encoded values represent the proportion of cases that were euthanized, or lived, for each level of `age`. (Note: The "died" level of the outcome variable is missing. This is because if we know the proportion that were euthanized and lived, we also know the proportion that died.)```{r}train_te %>%as.data.frame() %>%select(starts_with('age') &ends_with('te'), age) %>%distinct()``````{r}# Drop the unencoded columnstrain_te %>%as.data.frame() %>%select(-all_of(encode_columns)) %>%as.h2o() -> train_tetest_te %>%as.data.frame() %>%select(-all_of(encode_columns)) %>%as.h2o() -> test_te``````{r}# Create a recipe to use laterrecipe_target <-recipe(outcome ~ ., data = train_te %>%as.data.frame() %>%select(-id)) %>%step_normalize(all_numeric_predictors())```# Embedding encodingEmbedding is categorical encoding method that that uses deep learning to represent categorical features as vectors. It's particularly useful for categorical features with many levels, since it can be used to project high-dimensional features into low-dimensional space.For example, the variable `pain` has 7 levels.```{r}levels(train$pain)```But using embeddings, I can "project" these 7 levels onto a smaller set of dimensions -- say 3.```{r}pain_embedding <-recipe(outcome ~ ., data = train %>%select(-id)) %>%step_normalize(all_numeric_predictors()) %>% embed::step_embed(pain, outcome =vars(outcome),predictors =all_numeric_predictors(),hidden_units =2,num_terms =3,keep_original_cols = T)tensorflow::set_random_seed(42)pain_embedding %>%prep() %>%bake(new_data =NULL) %>%select(starts_with('pain')) %>%distinct()```I want to use embedding strategically. Since embeddings are particularly useful to project into lower-dimensional space, this means it's going to be most useful for categorical variables that have many levels. For variables with fewer than 3 levels, I'll use one-hot encoding. For variables with more than 3 levels, I'll use embeddings and project them onto 3 levels. I'll project `hospital_number` onto 50 levels, and `lesion_1` onto 25 levels. (This is somewhat arbitrary; I did a quick few tests -- not shown here -- to arrive at these numbers.)```{r}length(levels(df$hospital_number))``````{r}cat_cols <-colnames(train %>%select(where(is.factor), -outcome, -hospital_number))cols_for_onehot <-c()cols_for_embedding <-c()cols_embedding_special <-c('lesion_1', 'hospital_number')for(col in cat_cols){if(nrow(distinct(train[col])) <=3){ cols_for_onehot =append(cols_for_onehot, col) }else { cols_for_embedding =append(cols_for_embedding, col) cols_for_embedding = cols_for_embedding[!cols_for_embedding %in% cols_embedding_special] }}recipe_embedding <-recipe(outcome ~ ., data = train %>%select(-id)) %>%step_normalize(all_numeric_predictors()) %>%step_novel(all_of(cols_for_onehot), new_level ="NA") %>%step_dummy(all_of(cols_for_onehot), one_hot=T) %>% embed::step_embed(all_of(cols_for_embedding), outcome =vars(outcome),predictors =all_numeric_predictors(),hidden_units =2,num_terms =3,keep_original_cols = F) %>% embed::step_embed(hospital_number, outcome =vars(outcome),predictors =all_numeric_predictors(),hidden_units =2,num_terms =50,keep_original_cols = F) %>% embed::step_embed(lesion_1, outcome =vars(outcome),predictors =all_numeric_predictors(),hidden_units =2,num_terms =25,keep_original_cols = F)```# ModelingNow onto some modeling.I'll define 3 models to evaluate: multinomial logistic regression, random forest, and xgboost.```{r}multinom_mod <-multinom_reg() %>%# Need to bump the max weights, otherwise it won't runset_engine("nnet", MaxNWts =10000) %>%set_mode("classification")ranger_mod <-rand_forest(trees=1000) %>%set_engine("ranger") %>%set_mode("classification")xgboost_mod <-boost_tree(trees=50) %>%set_engine("xgboost") %>%set_mode("classification")```## Model fit functionWith 3 models and 5 categorical encodings, I'll need to fit 15 models. To streamline this process, I'll define two functions:- `fit_model()`: Given training and test datasets, a workflow containing a recipe for the categorical encoding, a model type, and an encoding type, this function will evaluate the model in-sample using cross-validation, then evaluate it out-of-sample, and then return a dataframe containing the results- `fit_encodings()`: Given a model and model type, this function will generate recipes for each of the 5 categorical encodings, fit the 5 encodings using the model, and then return a dataframe with the results```{r}fit_model <-function(train, test, workflow, model_type, encoding_type){set.seed(42) folds <-vfold_cv(train, v =5) resampled_fit <- workflow %>%fit_resamples(folds,metrics =metric_set(f_meas))# Get in-sample F1 (resampled_fit %>%collect_metrics())$mean -> train_perf# Get out-of-sample F1 fit <- workflow %>%fit(train) test$pred <-predict(fit, test)$.pred_class (f_meas(test, outcome, pred, estimator='micro'))$.estimate -> test_perf# Combine in-sample and out-of-sample into a dataframe df_perf <-data.frame(model_type = model_type,encoding_type = encoding_type,train_perf = train_perf,test_perf = test_perf)return(df_perf)}# Given a model, run it across the 4 encodings and return a dataframe that summarizes the resultsfit_encodings <-function(model, model_type){set.seed(42) tensorflow::set_random_seed(42)# One-hot encoded model wflow_1hot <-workflow() %>%add_model(model) %>%add_recipe(recipe_1hot_with_novel)fit_model(train %>%select(-id), test %>%select(-id), wflow_1hot, model_type,'onehot') -> onehot_model_results# Label encoded model wflow_label <-workflow() %>%add_model(model) %>%add_recipe(recipe_label)fit_model(train %>%select(-id), test %>%select(-id), wflow_label, model_type,'label') -> label_model_results# Frequency encoded model wflow_freq <-workflow() %>%add_model(model) %>%add_recipe(recipe_freq)fit_model(train_freq %>%select(-id), test_freq %>%select(-id), wflow_freq, model_type,'frequency') -> freq_model_results# Target encoded model wflow_target <-workflow() %>%add_model(model) %>%add_recipe(recipe_target)fit_model(train_te %>%as.data.frame() %>%select(-id), test_te %>%as.data.frame() %>%select(-id), wflow_target, model_type,'target') -> target_model_results# Embedding encoded model wflow_embedding <-workflow() %>%add_model(model) %>%add_recipe(recipe_embedding)fit_model(train %>%as.data.frame() %>%select(-id), test %>%as.data.frame() %>%select(-id), wflow_embedding, model_type,'embedding') -> embedding_model_results# Compile results into a dataframe onehot_model_results %>%bind_rows(label_model_results) %>%bind_rows(freq_model_results) %>%bind_rows(target_model_results) %>%bind_rows(embedding_model_results) -> results results}```I'll run each of the models using the `fit_encodings()` and `fit_model()` functions that I just defined.```{r}fit_encodings(multinom_mod, 'multinomial logistic') -> multinom_resultsfit_encodings(ranger_mod, 'random forest') -> rf_resultsfit_encodings(xgboost_mod, 'xgboost') -> xgb_results```# Model resultsLooking at the results, I can see that best models used embedding encoding.```{r}multinom_results %>%bind_rows(rf_results) %>%bind_rows(xgb_results) -> model_resultsmodel_results %>%arrange(desc(test_perf))model_results %>%mutate(colors =ifelse(encoding_type =='embedding', '1', '0')) %>%ggplot() +geom_col(aes(x = model_type, group = encoding_type, fill = encoding_type, y = test_perf,color = colors), position='dodge') +scale_y_continuous(limits=c(0.60, 0.75), oob = rescale_none, breaks =seq(0.60, 0.75, 0.01)) +labs(title ='F1 score by model type and categorical encoding method', subtitle ='The best models used embedding encoding',fill ='Encoding', y ='F1 score', x ='Model') +scale_color_manual(values=c("white", "black")) +guides(colour ="none") +theme_minimal()``````{r}#ggsave <- function(..., bg = 'white') ggplot2::ggsave(..., bg = bg)#ggsave('social-image.png', width=1600, height=900, units='px')```