Encoding high cardinality features with "embeddings"

Embedding is categorical encoding method that that uses deep learning to represent categorical features as vectors. It’s particularly useful for categorical features with many levels, since it can be used to project high-dimensional features into low-dimensional space.

In this blog post, I’ll show how ML models with embedding encoding outperform models with other common categorical encoding methods (frequency, label, one-hot, and target). For this demonstration, I’ll be using the dataset from Kaggle’s Playground Series S3E22: Predict Health Outcomes of Horses.

Load libraries¶

library(tidyverse)
library(tidymodels)
library(lares)
library(ranger)
library(xgboost)
library(tensorflow)
library(keras)
library(h2o)
library(encodeR)

h2o.init()
h2o.no_progress()

Load the data¶

df <- read_csv('data/playground-series-s3e22/train.csv', show_col_types = F)

Categorical variables¶

This dataset contains 17 character columns.

colnames(df[, sapply(df, class) == 'character'])

However, there are also several numeric categorical columns: hospital_number, lesion_1, lesion_2, and lesion_3.

df %>%
  select(hospital_number, lesion_1, lesion_2, lesion_3) %>%
  head()

hospital_number and lesion_1 are of particular interest because they have so many levels.

length(unique(df$hospital_number))
length(unique(df$lesion_1))
length(unique(df$lesion_2))
length(unique(df$lesion_3))

Looking at the character columns, I see some case inconsistency (i.e., some columns have both “None” and “none”). Converting all strings to lowercase would help to at least combine the “none” types.

char_cols <- colnames(df %>% select(where(is.character)))
for(col in char_cols){
  print(paste0(toupper(col), ': ', paste0(distinct(df[col])[[1]], collapse=', ')))
}

Train/test split¶

Before diving into categorical encoding methods, I’ll do a train/test split. I’ll also convert the character columns to lowercase to address the problem I mentioned above (“None” vs. “none”), and I’ll convert hospital_number to categorical.

set.seed(42)

df %>% mutate_if(where(is.character), .funs=tolower) %>%
  mutate(outcome = as.factor(outcome)) %>%
  mutate(across(where(is.character), factor),
         hospital_number = as.factor(hospital_number),
         lesion_1 = as.factor(lesion_1),
         lesion_2 = as.factor(lesion_2),
         lesion_3 = as.factor(lesion_3)) -> df

split <- initial_split(df)
train <- training(split)
test <- testing(split)

Categorical encoding¶

One-hot encoding¶

recipe_1hot_with_novel <-
  recipe(outcome ~ ., data = train %>% select(-id)) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_novel(all_nominal_predictors(), new_level = "NA") %>%
  step_dummy(all_nominal_predictors(), one_hot=T)

The first -- and probably most popular -- type of categorical encoding is one-hot encoding. One-hot encoding transforms a single categorical variable with N levels into binary variables encoding each of the N levels.

For example, age is a categorical variable with 2 levels.

levels(train$age)
length(levels(train$age))

When age is one-hot encoded, a column is created for each level to encode the value (e.g., if the original value was adult, then the age_adult column gets a 1 and the other columns get a 0). And since I’ve also included a step to encode novel levels as NA, there is also a third column for that.

recipe_1hot_with_novel %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(starts_with('age')) %>%
  head(3)

Label encoding¶

recipe_label <-
  recipe(outcome ~ ., data = train %>% select(-id)) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_integer(all_nominal_predictors())

With label encoding, each level of the categorical variable is given an (arbitrary) number. In the tidymodels framework, step_integer works like scikit’s LabelEncoder, and encodes new values as zero. Here we see that one level of age was encoded as “1” and the other was encoded as “2”.

recipe_label %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(age) %>%
  distinct

Frequency encoding¶

freq_encoding <- encodeR::frequency_encoder(
  X_train = train,
  X_test = test,
  cat_columns = colnames(df %>% select(where(is.factor), -outcome))
)

train_freq <- freq_encoding$train
test_freq <- freq_encoding$test

With frequency encoding, levels of the categorical variable are replaced with their frequency. Here, we can see how the levels of age have been replaced with their frequency in the training set. (When this is applied to the test set, these same training frequencies will be used.)

train_freq %>%
  select(age) %>%
  distinct()

recipe_freq <-
  recipe(outcome ~ ., data = train_freq %>% select(-id)) %>%
  step_normalize(all_numeric_predictors())

Target encoding¶

For target encoding (also called “effect encoding” or “likelihood encoding”), I’ll be using the h2o package because it supports multi-class targets. (The embed package can also do target encoding and integrates better with a tidymodels workflow, but at the moment it only supports binary targets.)

Using h2o requires some additional setup.

# Convert to h2o format
df_h2o <- as.h2o(df)

# Split the dataset into train and test
splits_h2o <- h2o.splitFrame(data = df_h2o, ratios = .8, seed = 42)
train_h2o <- splits_h2o[[1]]
test_h2o <- splits_h2o[[2]]

With target encoding, the levels of the categorical variable are replaced with their mean value on the target. For example, if the level “young” was associated with a mean target value of 0.75, then this is the value with which that level would be replaced.

Because the outcome is being used for encoding, care needs to be taken when using this method to avoid leakage and overfitting. In this case, I’ll use the “Leave One Out” method: for each row, the mean is calculated over all rows excluding that row.

# Choose which columns to encode
encode_columns <- colnames(df %>% select(where(is.factor), -outcome)) # All categorical variables

# Train a TE model
te_model <- h2o.targetencoder(x = encode_columns,
                              y = 'outcome',
                              keep_original_categorical_columns=T,
                              training_frame = train_h2o,
                              noise=0,
                              seed=100,
                              blending = T, # Blending helps with levels that are more rare
                              data_leakage_handling = "LeaveOneOut")

# New target encoded training and test datasets
train_te <- h2o.transform(te_model, train_h2o)
test_te <- h2o.transform(te_model, test_h2o)

Here we can see how the target encoding strategy encoded age: Two new variables are created, age_euthanized_te and age_lived_te. The encoded values represent the proportion of cases that were euthanized, or lived, for each level of age. (Note: The “died” level of the outcome variable is missing. This is because if we know the proportion that were euthanized and lived, we also know the proportion that died.)

train_te %>%
  as.data.frame() %>%
  select(starts_with('age') & ends_with('te'), age) %>%
  distinct()

# Drop the unencoded columns
train_te %>%
  as.data.frame() %>%
  select(-all_of(encode_columns)) %>%
  as.h2o() -> train_te
test_te %>%
  as.data.frame() %>%
  select(-all_of(encode_columns)) %>%
  as.h2o() -> test_te

# Create a recipe to use later
recipe_target <-
  recipe(outcome ~ ., data = train_te %>% as.data.frame() %>% select(-id)) %>%
  step_normalize(all_numeric_predictors())

Embedding encoding¶

For example, the variable pain has 7 levels.

levels(train$pain)

But using embeddings, I can “project” these 7 levels onto a smaller set of dimensions -- say 3.

pain_embedding <-
  recipe(outcome ~ ., data = train %>% select(-id)) %>%
  step_normalize(all_numeric_predictors()) %>%
  embed::step_embed(pain,
                    outcome = vars(outcome),
                    predictors = all_numeric_predictors(),
                    hidden_units = 2,
                    num_terms = 3,
                    keep_original_cols = T)

tensorflow::set_random_seed(42)
pain_embedding %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(starts_with('pain')) %>%
  distinct()

I want to use embedding strategically. Since embeddings are particularly useful to project into lower-dimensional space, this means it’s going to be most useful for categorical variables that have many levels. For variables with fewer than 3 levels, I’ll use one-hot encoding. For variables with more than 3 levels, I’ll use embeddings and project them onto 3 levels. I’ll project hospital_number onto 50 levels, and lesion_1 onto 25 levels. (This is somewhat arbitrary; I did a quick few tests -- not shown here -- to arrive at these numbers.)

length(levels(df$hospital_number))

cat_cols <- colnames(train %>% select(where(is.factor), -outcome, -hospital_number))
cols_for_onehot <- c()
cols_for_embedding <- c()
cols_embedding_special <- c('lesion_1', 'hospital_number')
for(col in cat_cols){
  if(nrow(distinct(train[col])) <= 3){
    cols_for_onehot = append(cols_for_onehot, col)
  }
  else {
    cols_for_embedding = append(cols_for_embedding, col)
    cols_for_embedding = cols_for_embedding[!cols_for_embedding %in% cols_embedding_special]
  }
}

recipe_embedding <-
  recipe(outcome ~ ., data = train %>% select(-id)) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_novel(all_of(cols_for_onehot), new_level = "NA") %>%
  step_dummy(all_of(cols_for_onehot), one_hot=T) %>%
  embed::step_embed(all_of(cols_for_embedding),
                    outcome = vars(outcome),
                    predictors = all_numeric_predictors(),
                    hidden_units = 2,
                    num_terms = 3,
                    keep_original_cols = F) %>%
  embed::step_embed(hospital_number,
                    outcome = vars(outcome),
                    predictors = all_numeric_predictors(),
                    hidden_units = 2,
                    num_terms = 50,
                    keep_original_cols = F) %>%
  embed::step_embed(lesion_1,
                    outcome = vars(outcome),
                    predictors = all_numeric_predictors(),
                    hidden_units = 2,
                    num_terms = 25,
                    keep_original_cols = F)

Modeling¶

Now onto some modeling.

I’ll define 3 models to evaluate: multinomial logistic regression, random forest, and xgboost.

multinom_mod <-
  multinom_reg() %>%
  # Need to bump the max weights, otherwise it won't run
  set_engine("nnet", MaxNWts = 10000) %>%
  set_mode("classification")

ranger_mod <-
  rand_forest(trees=1000) %>%
  set_engine("ranger") %>%
  set_mode("classification")

xgboost_mod <-
  boost_tree(trees=50) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

Model fit function¶

With 3 models and 5 categorical encodings, I’ll need to fit 15 models. To streamline this process, I’ll define two functions:

fit_model(): Given training and test datasets, a workflow containing a recipe for the categorical encoding, a model type, and an encoding type, this function will evaluate the model in-sample using cross-validation, then evaluate it out-of-sample, and then return a dataframe containing the results
fit_encodings(): Given a model and model type, this function will generate recipes for each of the 5 categorical encodings, fit the 5 encodings using the model, and then return a dataframe with the results

fit_model <- function(train, test, workflow, model_type, encoding_type){

  set.seed(42)
  folds <- vfold_cv(train, v = 5)

  resampled_fit <-
    workflow %>%
    fit_resamples(folds,
                  metrics = metric_set(f_meas))

  # Get in-sample F1
  (resampled_fit %>%
    collect_metrics())$mean -> train_perf

  # Get out-of-sample F1
  fit <-
    workflow %>%
    fit(train)

  test$pred <- predict(fit, test)$.pred_class
  (f_meas(test, outcome, pred, estimator='micro'))$.estimate -> test_perf

  # Combine in-sample and out-of-sample into a dataframe
  df_perf <- data.frame(model_type = model_type,
                        encoding_type = encoding_type,
                        train_perf = train_perf,
                        test_perf = test_perf)
  return(df_perf)
}


# Given a model, run it across the 4 encodings and return a dataframe that summarizes the results
fit_encodings <- function(model, model_type){

  set.seed(42)
  tensorflow::set_random_seed(42)

  # One-hot encoded model
  wflow_1hot <-
    workflow() %>%
    add_model(model) %>%
    add_recipe(recipe_1hot_with_novel)

  fit_model(train %>% select(-id),
            test %>% select(-id),
            wflow_1hot,
            model_type,
            'onehot') -> onehot_model_results

  # Label encoded model
  wflow_label <-
    workflow() %>%
    add_model(model) %>%
    add_recipe(recipe_label)

  fit_model(train %>% select(-id),
            test %>% select(-id),
            wflow_label,
            model_type,
            'label') -> label_model_results

  # Frequency encoded model
  wflow_freq <-
    workflow() %>%
    add_model(model) %>%
    add_recipe(recipe_freq)

  fit_model(train_freq %>% select(-id),
            test_freq %>% select(-id),
            wflow_freq,
            model_type,
            'frequency') -> freq_model_results

  # Target encoded model
  wflow_target <-
    workflow() %>%
    add_model(model) %>%
    add_recipe(recipe_target)

  fit_model(train_te %>% as.data.frame() %>% select(-id),
            test_te %>% as.data.frame() %>% select(-id),
            wflow_target,
            model_type,
            'target') -> target_model_results


  # Embedding encoded model
  wflow_embedding <-
    workflow() %>%
    add_model(model) %>%
    add_recipe(recipe_embedding)

  fit_model(train %>% as.data.frame() %>% select(-id),
            test %>% as.data.frame() %>% select(-id),
            wflow_embedding,
            model_type,
            'embedding') -> embedding_model_results

  # Compile results into a dataframe
  onehot_model_results %>%
    bind_rows(label_model_results) %>%
    bind_rows(freq_model_results) %>%
    bind_rows(target_model_results) %>%
    bind_rows(embedding_model_results) -> results

  results
}

I’ll run each of the models using the fit_encodings() and fit_model() functions that I just defined.

fit_encodings(multinom_mod, 'multinomial logistic') -> multinom_results
fit_encodings(ranger_mod, 'random forest') -> rf_results
fit_encodings(xgboost_mod, 'xgboost') -> xgb_results

Model results¶

Looking at the results, I can see that best models used embedding encoding.

multinom_results %>%
  bind_rows(rf_results) %>%
  bind_rows(xgb_results) -> model_results

model_results %>%
  arrange(desc(test_perf))

model_results %>%
  mutate(colors = ifelse(encoding_type == 'embedding', '1', '0')) %>%
  ggplot() +
    geom_col(aes(x = model_type,
                 group = encoding_type,
                 fill = encoding_type,
                 y = test_perf,
                 color = colors), position='dodge') +
    scale_y_continuous(limits=c(0.60, 0.75), oob = rescale_none, breaks = seq(0.60, 0.75, 0.01)) +
    labs(title = 'F1 score by model type and categorical encoding method',
         subtitle = 'The best models used embedding encoding',
         fill = 'Encoding', y = 'F1 score', x = 'Model') +
    scale_color_manual(values=c("white", "black")) +
    guides(colour = "none") +
    theme_minimal()

#ggsave <- function(..., bg = 'white') ggplot2::ggsave(..., bg = bg)
#ggsave('social-image.png', width=1600, height=900, units='px')