Embedding is categorical encoding method that that uses deep learning to represent categorical features as vectors. It’s particularly useful for categorical features with many levels, since it can be used to project high-dimensional features into low-dimensional space.

In this blog post, I’ll show how ML models with embedding encoding outperform models with other common categorical encoding methods (frequency, label, one-hot, and target). For this demonstration, I’ll be using the dataset from Kaggle’s Playground Series S3E22: Predict Health Outcomes of Horses.

Load libraries

library(tidyverse)
library(tidymodels)
library(lares)
library(ranger)
library(xgboost)
library(tensorflow)
library(keras)
library(h2o)
library(encodeR)

h2o.init()

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         6 days 23 hours 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.42.0.2 
    H2O cluster version age:    1 month and 30 days 
    H2O cluster name:           H2O_started_from_R_tyler_wkj161 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.11 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.3.1 (2023-06-16)

h2o.no_progress()

Load the data

df <- read_csv('data/playground-series-s3e22/train.csv', show_col_types = F)

Categorical variables

This dataset contains 17 character columns.

colnames(df[, sapply(df, class) == 'character'])

 [1] "surgery"               "age"                   "temp_of_extremities"  
 [4] "peripheral_pulse"      "mucous_membrane"       "capillary_refill_time"
 [7] "pain"                  "peristalsis"           "abdominal_distention" 
[10] "nasogastric_tube"      "nasogastric_reflux"    "rectal_exam_feces"    
[13] "abdomen"               "abdomo_appearance"     "surgical_lesion"      
[16] "cp_data"               "outcome"

However, there are also several numeric categorical columns: hospital_number, lesion_1, lesion_2, and lesion_3.

df %>%
  select(hospital_number, lesion_1, lesion_2, lesion_3) %>%
  head()

# A tibble: 6 × 4
  hospital_number lesion_1 lesion_2 lesion_3
            <dbl>    <dbl>    <dbl>    <dbl>
1          530001     2209        0        0
2          533836     2208        0        0
3          529812     5124        0        0
4         5262541     2208        0        0
5         5299629        0        0        0
6          529642        0        0        0

hospital_number and lesion_1 are of particular interest because they have so many levels.

length(unique(df$hospital_number))

[1] 255

length(unique(df$lesion_1))

[1] 57

length(unique(df$lesion_2))

[1] 4

length(unique(df$lesion_3))

[1] 2

Looking at the character columns, I see some case inconsistency (i.e., some columns have both “None” and “none”). Converting all strings to lowercase would help to at least combine the “none” types.

char_cols <- colnames(df %>% select(where(is.character)))
for(col in char_cols){
  print(paste0(toupper(col), ': ', paste0(distinct(df[col])[[1]], collapse=', ')))
}

[1] "SURGERY: yes, no"
[1] "AGE: adult, young"
[1] "TEMP_OF_EXTREMITIES: cool, cold, normal, warm, None"
[1] "PERIPHERAL_PULSE: reduced, normal, None, absent, increased"
[1] "MUCOUS_MEMBRANE: dark_cyanotic, pale_cyanotic, pale_pink, normal_pink, bright_pink, bright_red, None"
[1] "CAPILLARY_REFILL_TIME: more_3_sec, less_3_sec, None, 3"
[1] "PAIN: depressed, mild_pain, extreme_pain, alert, severe_pain, None, slight"
[1] "PERISTALSIS: absent, hypomotile, normal, hypermotile, None, distend_small"
[1] "ABDOMINAL_DISTENTION: slight, moderate, none, severe, None"
[1] "NASOGASTRIC_TUBE: slight, none, significant, None"
[1] "NASOGASTRIC_REFLUX: less_1_liter, more_1_liter, none, None, slight"
[1] "RECTAL_EXAM_FECES: decreased, absent, None, normal, increased, serosanguious"
[1] "ABDOMEN: distend_small, distend_large, normal, firm, None, other"
[1] "ABDOMO_APPEARANCE: serosanguious, cloudy, clear, None"
[1] "SURGICAL_LESION: yes, no"
[1] "CP_DATA: no, yes"
[1] "OUTCOME: died, euthanized, lived"

Train/test split

Before diving into categorical encoding methods, I’ll do a train/test split. I’ll also convert the character columns to lowercase to address the problem I mentioned above (“None” vs. “none”), and I’ll convert hospital_number to categorical.

set.seed(42)

df %>% mutate_if(where(is.character), .funs=tolower) %>%
  mutate(outcome = as.factor(outcome)) %>%
  mutate(across(where(is.character), factor),
         hospital_number = as.factor(hospital_number),
         lesion_1 = as.factor(lesion_1),
         lesion_2 = as.factor(lesion_2),
         lesion_3 = as.factor(lesion_3)) -> df

split <- initial_split(df)
train <- training(split)
test <- testing(split)

Categorical encoding

One-hot encoding

recipe_1hot_with_novel <- 
  recipe(outcome ~ ., data = train %>% select(-id)) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_novel(all_nominal_predictors(), new_level = "NA") %>%
  step_dummy(all_nominal_predictors(), one_hot=T)

The first – and probably most popular – type of categorical encoding is one-hot encoding. One-hot encoding transforms a single categorical variable with N levels into binary variables encoding each of the N levels.

For example, age is a categorical variable with 2 levels.

levels(train$age)

[1] "adult" "young"

length(levels(train$age))

[1] 2

When age is one-hot encoded, a column is created for each level to encode the value (e.g., if the original value was adult, then the age_adult column gets a 1 and the other columns get a 0). And since I’ve also included a step to encode novel levels as NA, there is also a third column for that.

recipe_1hot_with_novel %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(starts_with('age')) %>%
  head(3)

# A tibble: 3 × 3
  age_adult age_young age_NA.
      <dbl>     <dbl>   <dbl>
1         1         0       0
2         1         0       0
3         1         0       0

Label encoding

recipe_label <- 
  recipe(outcome ~ ., data = train %>% select(-id)) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_integer(all_nominal_predictors())

With label encoding, each level of the categorical variable is given an (arbitrary) number. In the tidymodels framework, step_integer works like scikit’s LabelEncoder, and encodes new values as zero. Here we see that one level of age was encoded as “1” and the other was encoded as “2”.

recipe_label %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(age) %>%
  distinct

# A tibble: 2 × 1
    age
  <int>
1     1
2     2

Frequency encoding

freq_encoding <- encodeR::frequency_encoder(
  X_train = train,
  X_test = test, 
  cat_columns = colnames(df %>% select(where(is.factor), -outcome))
)

train_freq <- freq_encoding$train
test_freq <- freq_encoding$test

With frequency encoding, levels of the categorical variable are replaced with their frequency. Here, we can see how the levels of age have been replaced with their frequency in the training set. (When this is applied to the test set, these same training frequencies will be used.)

train_freq %>%
  select(age) %>%
  distinct()

# A tibble: 2 × 1
     age
   <dbl>
1 0.937 
2 0.0626

recipe_freq <- 
  recipe(outcome ~ ., data = train_freq %>% select(-id)) %>%
  step_normalize(all_numeric_predictors())

Target encoding

For target encoding (also called “effect encoding” or “likelihood encoding”), I’ll be using the h2o package because it supports multi-class targets. (The embed package can also do target encoding and integrates better with a tidymodels workflow, but at the moment it only supports binary targets.)

Using h2o requires some additional setup.

# Convert to h2o format
df_h2o <- as.h2o(df)

# Split the dataset into train and test
splits_h2o <- h2o.splitFrame(data = df_h2o, ratios = .8, seed = 42)
train_h2o <- splits_h2o[[1]]
test_h2o <- splits_h2o[[2]]

With target encoding, the levels of the categorical variable are replaced with their mean value on the target. For example, if the level “young” was associated with a mean target value of 0.75, then this is the value with which that level would be replaced.

Because the outcome is being used for encoding, care needs to be taken when using this method to avoid leakage and overfitting. In this case, I’ll use the “Leave One Out” method: for each row, the mean is calculated over all rows excluding that row.

# Choose which columns to encode
encode_columns <- colnames(df %>% select(where(is.factor), -outcome)) # All categorical variables

# Train a TE model
te_model <- h2o.targetencoder(x = encode_columns,
                              y = 'outcome', 
                              keep_original_categorical_columns=T,
                              training_frame = train_h2o,
                              noise=0,
                              seed=100,
                              blending = T, # Blending helps with levels that are more rare
                              data_leakage_handling = "LeaveOneOut")

# New target encoded training and test datasets
train_te <- h2o.transform(te_model, train_h2o)
test_te <- h2o.transform(te_model, test_h2o)

Here we can see how the target encoding strategy encoded age: Two new variables are created, age_euthanized_te and age_lived_te. The encoded values represent the proportion of cases that were euthanized, or lived, for each level of age. (Note: The “died” level of the outcome variable is missing. This is because if we know the proportion that were euthanized and lived, we also know the proportion that died.)

train_te %>%
  as.data.frame() %>%
  select(starts_with('age') & ends_with('te'), age) %>%
  distinct()

  age_euthanized_te age_lived_te   age
1        0.20937841    0.4776445 adult
2        0.06005923    0.2157985 young

# Drop the unencoded columns
train_te %>% 
  as.data.frame() %>%
  select(-all_of(encode_columns)) %>%
  as.h2o() -> train_te
test_te %>% 
  as.data.frame() %>%
  select(-all_of(encode_columns)) %>%
  as.h2o() -> test_te

# Create a recipe to use later
recipe_target <- 
  recipe(outcome ~ ., data = train_te %>% as.data.frame() %>% select(-id)) %>%
  step_normalize(all_numeric_predictors())

Embedding encoding

For example, the variable pain has 7 levels.

levels(train$pain)

[1] "alert"        "depressed"    "extreme_pain" "mild_pain"    "none"        
[6] "severe_pain"  "slight"

But using embeddings, I can “project” these 7 levels onto a smaller set of dimensions – say 3.

pain_embedding <- 
  recipe(outcome ~ ., data = train %>% select(-id)) %>%
  step_normalize(all_numeric_predictors()) %>%
  embed::step_embed(pain, 
                    outcome = vars(outcome),
                    predictors = all_numeric_predictors(),
                    hidden_units = 2,
                    num_terms = 3,
                    keep_original_cols = T)

tensorflow::set_random_seed(42)
pain_embedding %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(starts_with('pain')) %>%
  distinct()

# A tibble: 6 × 4
  pain         pain_embed_1 pain_embed_2 pain_embed_3
  <fct>               <dbl>        <dbl>        <dbl>
1 mild_pain         0.0466      -0.00297     0.000295
2 depressed         0.0470      -0.0332      0.00238 
3 severe_pain      -0.00442     -0.0437     -0.0509  
4 alert             0.0450      -0.0422      0.0152  
5 extreme_pain      0.0131      -0.0414     -0.00536 
6 none              0.0293      -0.0480     -0.0285

I want to use embedding strategically. Since embeddings are particularly useful to project into lower-dimensional space, this means it’s going to be most useful for categorical variables that have many levels. For variables with fewer than 3 levels, I’ll use one-hot encoding. For variables with more than 3 levels, I’ll use embeddings and project them onto 3 levels. I’ll project hospital_number onto 50 levels, and lesion_1 onto 25 levels. (This is somewhat arbitrary; I did a quick few tests – not shown here – to arrive at these numbers.)

length(levels(df$hospital_number))

[1] 255

cat_cols <- colnames(train %>% select(where(is.factor), -outcome, -hospital_number))
cols_for_onehot <- c()
cols_for_embedding <- c()
cols_embedding_special <- c('lesion_1', 'hospital_number')
for(col in cat_cols){
  if(nrow(distinct(train[col])) <= 3){
    cols_for_onehot = append(cols_for_onehot, col)
  }
  else {
    cols_for_embedding = append(cols_for_embedding, col)
    cols_for_embedding = cols_for_embedding[!cols_for_embedding %in% cols_embedding_special]
  }
}

recipe_embedding <- 
  recipe(outcome ~ ., data = train %>% select(-id)) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_novel(all_of(cols_for_onehot), new_level = "NA") %>%
  step_dummy(all_of(cols_for_onehot), one_hot=T) %>%
  embed::step_embed(all_of(cols_for_embedding), 
                    outcome = vars(outcome),
                    predictors = all_numeric_predictors(),
                    hidden_units = 2,
                    num_terms = 3,
                    keep_original_cols = F) %>%
  embed::step_embed(hospital_number, 
                    outcome = vars(outcome),
                    predictors = all_numeric_predictors(),
                    hidden_units = 2,
                    num_terms = 50,
                    keep_original_cols = F) %>%
  embed::step_embed(lesion_1, 
                    outcome = vars(outcome),
                    predictors = all_numeric_predictors(),
                    hidden_units = 2,
                    num_terms = 25,
                    keep_original_cols = F)

Modeling

Now onto some modeling.

I’ll define 3 models to evaluate: multinomial logistic regression, random forest, and xgboost.

multinom_mod <-
  multinom_reg() %>%
  # Need to bump the max weights, otherwise it won't run
  set_engine("nnet", MaxNWts = 10000) %>% 
  set_mode("classification")

ranger_mod <-
  rand_forest(trees=1000) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

xgboost_mod <-
  boost_tree(trees=50) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

Model fit function

With 3 models and 5 categorical encodings, I’ll need to fit 15 models. To streamline this process, I’ll define two functions:

fit_model(): Given training and test datasets, a workflow containing a recipe for the categorical encoding, a model type, and an encoding type, this function will evaluate the model in-sample using cross-validation, then evaluate it out-of-sample, and then return a dataframe containing the results
fit_encodings(): Given a model and model type, this function will generate recipes for each of the 5 categorical encodings, fit the 5 encodings using the model, and then return a dataframe with the results

fit_model <- function(train, test, workflow, model_type, encoding_type){
  
  set.seed(42)
  folds <- vfold_cv(train, v = 5)
  
  resampled_fit <- 
    workflow %>% 
    fit_resamples(folds,
                  metrics = metric_set(f_meas))
  
  # Get in-sample F1
  (resampled_fit %>%
    collect_metrics())$mean -> train_perf
  
  # Get out-of-sample F1
  fit <- 
    workflow %>%
    fit(train)
  
  test$pred <- predict(fit, test)$.pred_class
  (f_meas(test, outcome, pred, estimator='micro'))$.estimate -> test_perf
  
  # Combine in-sample and out-of-sample into a dataframe
  df_perf <- data.frame(model_type = model_type,
                        encoding_type = encoding_type,
                        train_perf = train_perf,
                        test_perf = test_perf)
  return(df_perf)
}


# Given a model, run it across the 4 encodings and return a dataframe that summarizes the results
fit_encodings <- function(model, model_type){
  
  set.seed(42)
  tensorflow::set_random_seed(42)
  
  # One-hot encoded model
  wflow_1hot <- 
    workflow() %>% 
    add_model(model) %>%
    add_recipe(recipe_1hot_with_novel)
  
  fit_model(train %>% select(-id), 
            test %>% select(-id), 
            wflow_1hot, 
            model_type,
            'onehot') -> onehot_model_results
  
  # Label encoded model
  wflow_label <- 
    workflow() %>% 
    add_model(model) %>%
    add_recipe(recipe_label)
  
  fit_model(train %>% select(-id), 
            test %>% select(-id), 
            wflow_label, 
            model_type,
            'label') -> label_model_results
  
  # Frequency encoded model
  wflow_freq <- 
    workflow() %>% 
    add_model(model) %>%
    add_recipe(recipe_freq)
  
  fit_model(train_freq %>% select(-id), 
            test_freq %>% select(-id), 
            wflow_freq, 
            model_type,
            'frequency') -> freq_model_results
  
  # Target encoded model
  wflow_target <- 
    workflow() %>% 
    add_model(model) %>%
    add_recipe(recipe_target)
  
  fit_model(train_te %>% as.data.frame() %>% select(-id), 
            test_te %>% as.data.frame() %>% select(-id), 
            wflow_target, 
            model_type,
            'target') -> target_model_results
  
  
  # Embedding encoded model
  wflow_embedding <- 
    workflow() %>% 
    add_model(model) %>%
    add_recipe(recipe_embedding)
  
  fit_model(train %>% as.data.frame() %>% select(-id), 
            test %>% as.data.frame() %>% select(-id), 
            wflow_embedding, 
            model_type,
            'embedding') -> embedding_model_results
  
  # Compile results into a dataframe
  onehot_model_results %>%
    bind_rows(label_model_results) %>%
    bind_rows(freq_model_results) %>%
    bind_rows(target_model_results) %>%
    bind_rows(embedding_model_results) -> results
  
  results
}

I’ll run each of the models using the fit_encodings() and fit_model() functions that I just defined.

fit_encodings(multinom_mod, 'multinomial logistic') -> multinom_results
fit_encodings(ranger_mod, 'random forest') -> rf_results
fit_encodings(xgboost_mod, 'xgboost') -> xgb_results

Model results

Looking at the results, I can see that best models used embedding encoding.

multinom_results %>%
  bind_rows(rf_results) %>%
  bind_rows(xgb_results) -> model_results

model_results %>%
  arrange(desc(test_perf))

             model_type encoding_type train_perf test_perf
1         random forest     embedding  0.6929861 0.7411003
2               xgboost     embedding  0.6605987 0.7346278
3         random forest        onehot  0.6987098 0.7184466
4               xgboost        onehot  0.6767644 0.7055016
5               xgboost     frequency  0.6420888 0.7055016
6               xgboost         label  0.6731541 0.7022654
7         random forest     frequency  0.6880972 0.6990291
8         random forest        target  0.7679964 0.6964981
9         random forest         label  0.6880056 0.6925566
10 multinomial logistic         label  0.6304815 0.6828479
11 multinomial logistic     frequency  0.6343780 0.6796117
12 multinomial logistic     embedding  0.6223733 0.6699029
13 multinomial logistic        target  0.7761116 0.6614786
14              xgboost        target  0.7694992 0.6498054
15 multinomial logistic        onehot  0.5479512 0.6084142

model_results %>%
  mutate(colors = ifelse(encoding_type == 'embedding', '1', '0')) %>%
  ggplot() +
    geom_col(aes(x = model_type, 
                 group = encoding_type, 
                 fill = encoding_type, 
                 y = test_perf,
                 color = colors), position='dodge') +
    scale_y_continuous(limits=c(0.60, 0.75), oob = rescale_none, breaks = seq(0.60, 0.75, 0.01)) +
    labs(title = 'F1 score by model type and categorical encoding method', 
         subtitle = 'The best models used embedding encoding',
         fill = 'Encoding', y = 'F1 score', x = 'Model') +
    scale_color_manual(values=c("white", "black")) + 
    guides(colour = "none") +
    theme_minimal()

#ggsave <- function(..., bg = 'white') ggplot2::ggsave(..., bg = bg)
#ggsave('social-image.png', width=1600, height=900, units='px')