Surviving the Titanic with R caret || Machine Learning

- machine learning r caret kaggle

png

Introduction

I’ve been meaning to “upskill” into the world of machine learning for some time. I’m no stranger to statistics. I was trained as a behavioral scientist; I know my way around a linear regression! But at the same time, I tend to use statistics to test formal hypotheses, and this seems (at least to an outsider like me) pretty different from machine learning. I want to learn new skills that will help me be a better data scientist, and machine learning seemes like a good place to start.

So I decided to try Kaggle’s Titanic competition.

In this competition, the goal is to predict the survival of Titanic passengers whose fates are unknown, using what is known about some of the passengers who are known to have survived or perished. You get “training” and “test” data. Your goal is to train a statistical model on traing data so that it can generate accurate predictions on the outcome of interest when applied to test data.

For this competition I created a model trained on 9 features of Titanic passengers. These features involved transformations (e.g., binning) and “engineering” of existing variables (e.g., string extraction and manipulation).

Features

The following features were used in the model.

Feature Description
Titles Honorifics (e.g., “Mr.” or “Master”), extracted from Name strings
Sex Sex (male or female)
Embarked Port of embarkment
Family size, binned Sum of Parch and SibSp, binned
Fare, binned Binned Fare values
Passenger class Pclass without transformation
Age, binned Binned Age values
Ticket prefixes Grouped ticket prefixes (e.g., “SOTONO”), which presumably represent passengers’ destination
Race / ethnicity Bayesian prediction of race/ethnicity using passenger surname

In addition, although these features were not used to train the model per se, they were used later to adjust the predictions:

Feature Description
# family members How many family members a passenger has
# dead family How many family members died
All died (boolean) If all family members died

Lessons learned

I learned a lot by tackling this challenge. More than I can possibly list, but here are some of Here are some of the major lessons.

1. The illusion of (predictive) power

There were a few times where I encountered overfitting. I thought I’d found a significant new feature in the data. When I added it to the model, it increased the in-sample prediction accuracy by several percentage points. But when I applied this model to the test data, the out-of-sample prediction accuracy actually DECREASED. What happened? Well, as it turns out, it’s easy to find features that soak up variance in a training set, but those features might not be useful beyond the data they were trained on. Put simply: they don’t generalize.

Their power is an illusion.

gif

One example that kind of took me by surprise was that continuous variables like Age, and Fare, and Family Size performed really well on the training set, but they did not perform so well on the test data. When I binned these variables, they performed slightly worse on the training data, but they performed MUCH better on the test data.

2. What’s in a name, anyway?

Feature engineering was one of my favorite parts of the process. It was there that I could draw on the knowledge I’ve gained as a student of human behavior and psychology. The most trivial-seeming piece of data is in fact a proxy for something incredibly meaningful. You just need to know where to look, and how to manipulate and organize the data.

As it turns out, one of the most interesting variables was Name. From that variable I was able to extract a Bayesian prediction of the passenger’s race/ethnicity, family relationships (which told me if they had relatives who had also died aboard the ship), and social class (from titles like “Dr”, “Lady”, and “the Countess.

gif

3. A picture is worth a thousand words

As I was engineering features, I found that data visualization was vital. While I was considering different ways of slicing up the data, visualizations like histograms and bar plots were key to determining if a particular demarcation might be fruitful – if it was giving me any new signals about survival.

During this process, I found that my ability to quickly spin up different kinds visualizations was also important. Exploratory data analysis is a bit of a stream-of-consciousness, moving from one shallow thread of thought to another until something seems to click or pop out at you.

gif

4. Unconventional is sometimes better

I learned that it’s important to experiment with approaches that make logical sense, but from a tooling or procedural standpoint seem unconventional.

One of the most interesting variables was the number of family members who died. If a passenger had family members who died, they were more likely to die themselves. Although it was clear from the data visualizations that this feature was significant, I couldn’t improve the model’s predictive accuracy by simply plugging it into the model like all of the other variables. (In fact, doing so was detrimental – in one model variant, out-of-sample accuracy was reduced to ~40% – ouch!)

gif

But I persisted. Rather than train a model on this feature like I did for all the other features, I added a post-processing step in which the feature was used flip the bits on some of the passengers who were predicted to survive. Namely, if a passenger was predicted to survive, and if they had more than one family member, and if all of their family members died, then it seemed reasonable to predict that they also died.

This turned out to be a useful application of the feature, boosting the model’s out-of-sample accuracy by ~1%.

Model performance

So how did I do?

Using the random forest method, I achieved an in-sample prediction accuracy of 83% (kappa = 0.63) and an out-of-sample prediction accuracy of 82%. At the time of this writing, that puts my model in the top 6% of all Kaggle submissions.

Not bad! I think I’m getting the hang of this. 😁🤝🎉

png



Below is the code for my model building procedures, and some of the visualizations that I examined along the way to check assumptions and generate insights. Enjoy!

Loading packages

library("tidyverse")
library("caret")
library("ranger") # Faster RF modeling
library("wru") # Bayesian prediction of ethnicity

# Setup parallel processing
suppressWarnings(suppressMessages(library("parallel")))
suppressWarnings(suppressMessages(library("doParallel")))
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

# Custom function
`%not in%` <- function (x, table) is.na(match(x, table, nomatch=NA_integer_))

# Random seed
set.seed(43287)

# Figure height
options(repr.plot.width=6, repr.plot.height=4)
# Load data
test.csv <- read.csv('test.csv', stringsAsFactors = FALSE)
train.csv <- read.csv('train.csv', stringsAsFactors = FALSE)
train.csv$Survived <- as.factor(train.csv$Survived)
train <- train.csv
test <- test.csv

# Impute missing ages in Test data
test.csv %>% select(-Ticket, -Name, -Cabin, -Embarked) -> test.ages
pre.proc <- preProcess(test.ages, method = "bagImpute")
test.ages <- predict(pre.proc, test.ages)
test$Age <- test.ages$Age

# Impute missing ages in Training data
train.csv %>% select(-Ticket, -Name, -Cabin, -Embarked) -> train.ages
pre.proc <- preProcess(train.ages, method = "bagImpute")
train.ages <- predict(pre.proc, train.ages)
train$Age <- train.ages$Age

Features

Honorifics (titles)

train %>%
  rowwise %>%
  mutate(title = gsub('(.*, )|(\\..*)', '', Name)) -> train
Warning message:
"package 'bindrcpp' was built under R version 3.4.4"

Give unusual titles (e.g., “the Countess” or “Capt”) a special status. These titles are associated with survival among women. I think this suggests we’re on the right track.

# Titles
train %>%
  ggplot(., aes(title, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by titles")

# Special titles
train %>%
  filter(title %not in% c("Mr", "Ms", "Mrs", "Miss", "Master")) %>%
  ggplot(., aes(title, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by special titles by sex") + facet_wrap(~ Sex)

png

png

As I understand “Master” is a title given to young boys, while young girls are also often called “Miss”. One person with a “Dr” title is a woman, and so has a higher likelihood of survival than her male counterparts, so we’ll reclassify her.

train %>%
  mutate(child = ifelse(Age < 15, 1, 0),
         title.class = case_when(
            title %in% c("Master") ~ "Boy",
            title %in% c("Miss") & child == 1 ~ "Girl",
            title %in% c("Mrs") ~ "Woman",
            title %in% c("Miss", "Mme", "Mlle", "Ms") ~ "Lady",
            title %in% c("Dr") & Sex == "female" ~ "Woman",
            title %in% c("Mr") ~ "Man",
            title %in% c("Capt", "Don", "Col", "Major", "Dr", "Rev", 
                         "Dona", "Jonkheer", "the Countess", "Lady", "Sir") ~ "Special",
            TRUE ~ "Other")) -> train

train %>% ungroup %>% count(title.class)

train %>%
  ggplot(., aes(x=title.class, fill=Survived)) + 
  geom_bar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by title and sex") + facet_wrap(~ Sex)

train %>%
  ggplot(., aes(x=title.class, fill=Survived)) + 
  geom_bar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by title and Pclass") + facet_wrap(~ Pclass)
title.classn
Boy 40
Girl 44
Lady 142
Man 517
Special 22
Woman 126

png

png

Family size

Combine SibSp and Parch into a single “family size” variable.

train %>%
  mutate(family.size = SibSp + Parch,
         family.size.simple = case_when(
            family.size == 0 ~ "0",
            between(family.size, 1, 3) ~ "1-3",
            family.size > 3 ~ ">3")) -> train
train %>%
  ggplot(., aes(family.size)) +
  geom_histogram(bins=30)

train %>% ungroup %>% count(family.size.simple)

train %>%
  ggplot(., aes(family.size, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by family size and sex") + facet_wrap(~ Sex)

train %>%
  ggplot(., aes(family.size.simple, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by family size and sex") + facet_wrap(~ Sex)

train %>%
  ggplot(., aes(family.size.simple, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by family size and Pclass") + facet_wrap(~ Pclass)
family.size.simplen
>3 62
0 537
1-3292

png

png

png

png

Fare

Group fares into reasonably small buckets.

train %>%
  ggplot(., aes(Fare)) +
  geom_histogram(bins=30)

train %>%
  rowwise %>%
  mutate(Fare.grps = case_when(
            between(Fare, 0, 100) ~ "0-100",
            Fare > 100 ~ ">100")) -> train

train %>%
  ggplot(., aes(Fare.grps, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by fare groups and class") + facet_wrap(~ Pclass)

train %>% ungroup %>% count(Fare.grps)

png

Fare.grpsn
>100 53
0-100838

png

Passenger class

train %>%
  ggplot(., aes(Pclass, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by passenger class")

png

Age

Group ages into reasonably small buckets.

train %>%
  ggplot(., aes(Age)) +
  geom_histogram(bins=30)

train %>%
  mutate(Age.grps = case_when(
            between(Age, 0, 12) ~ "young",
            between(Age, 12, 50) ~ "middle",
            Age > 50 ~ "old")) -> train

train %>% ungroup %>% count(Age.grps)

train %>%
  ggplot(., aes(Age.grps, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Age groups survival by Pclass") + facet_wrap(~ Pclass)

train %>%
  ggplot(., aes(Age.grps, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Age groups survival by Sex") + facet_wrap(~ Sex)

train %>%
  ggplot(., aes(Age.grps, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Age groups survival by Sex and Pclass") + facet_wrap(~ Sex*Pclass)
Age.grpsn
middle744
old 65
young 82

png

png

png

png

Ticket prefix

Extract the prefix from the ticket, since this seems to represent the passenger’s final destination. Combine codes that are similar or seem to have a typo.

train %>%
    rowwise %>%
    mutate(ticket.prefix = ifelse(grepl(" ", Ticket), sub(" .*", "", Ticket), ""),
           ticket.prefix = toupper(gsub("(\\.)|(\\/)", "", ticket.prefix)),
           ticket.prefix = case_when(
            ticket.prefix == "STONO" ~ "SOTONO",
            ticket.prefix == "STONO2" ~ "SOTONO",
            ticket.prefix == "SOTONOQ" ~ "SOTONO",
            ticket.prefix == "SOTONO2" ~ "SOTONO",
            ticket.prefix == "SCAH" ~ "SCA",
            ticket.prefix == "SCA4" ~ "SCA",
            ticket.prefix == "PPP" ~ "PP",
            ticket.prefix == "CASOTON" ~ "CA",
            ticket.prefix == "AS" ~ "A",
            ticket.prefix == "A4" ~ "A",
            ticket.prefix == "A5" ~ "A",
            ticket.prefix == "SOP" ~ "SOPP",
            ticket.prefix == "SWPP" ~ "",
            ticket.prefix == "FA" ~ "",
            TRUE ~ ticket.prefix)) -> train

train %>%
  ggplot(., aes(ticket.prefix, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by ticket prefix") +
  coord_flip()

png

Race from surname

Predict the passenger’s race/ethnicity using their surname. Since there wasn’t a lot of diversity on the Titanic, we’ll combine into “White” and “Non-white” categories.

train %>%
    rowwise %>%
    mutate(surname = gsub('( .*)|(*,*)', '', Name)) -> train

predict_race(train, surname.only = T) -> train

train %>%
  rowwise %>%
  mutate(race = case_when(
            pred.whi == max(pred.whi, pred.bla, pred.his, pred.asi, pred.oth) ~ "White",
            TRUE ~ "Non-White"
          )) ->train

train %>%
  ggplot(., aes(race, fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by race and sex") + facet_wrap(~ Sex)
[1] "Proceeding with surname-only predictions..."


Warning message in merge_surnames(voter.file):
"Probabilities were imputed for 183 surnames that could not be matched to Census list."

png

Dead relatives

You are more likely to die if you’ve had other relatives who have died. It looks like some passengers have the same surname but embarked at different locations. It seems reasonable to assume that they were not actually family if they embarked from different places, so we’ll tag their surnames with the location of embarkment before calculating the number of dead family members.

train %>%
  select(-Embarked) %>%
  rowwise %>%
  cbind(Embarked = train.csv$Embarked) %>%
  mutate(surname.E = paste0(Embarked,".",surname)) -> train

train %>%
  ungroup %>%
  group_by(surname.E) %>%
  summarise(n.fam = n(),
            dead.family = sum(1 - as.numeric(as.character(Survived))),
            all.died = ifelse(dead.family/n.fam == 1, 1, 0)) -> dead.family.df

train %>%
  left_join(., dead.family.df, by = 'surname.E') -> train
train %>%
  ggplot(., aes(factor(all.died), fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by dead relatives (boolean)") + facet_wrap(~ Sex)

train %>%
  ggplot(., aes(factor(all.died), fill=Survived)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Survival by dead relatives (boolean)") + facet_wrap(~ Sex*Pclass)

png png

Models

Random forest

trctrl <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  allowParallel = TRUE)

tgctrl <- expand.grid(
    .mtry = 7,
    .splitrule = "gini",
    .min.node.size = 10)

fit.rf <- train(Survived ~ 
                    Sex + 
                    title.class + 
                    family.size.simple + 
                    Fare.grps + 
                    Pclass + 
                    Age.grps + 
                    Embarked + 
                    ticket.prefix + 
                    race
             data = train, 
             trControl = trctrl,
             metric = "Accuracy",
             importance = "impurity",
             tuneGrid = tgctrl,
             num.trees = 2000,
             method = "ranger")
fit.rf
Random Forest 

891 samples
  9 predictor
  2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 802, 802, 801, 802, 802, 802, ... 
Resampling results:

  Accuracy   Kappa    
  0.8295113  0.6252777

Tuning parameter 'mtry' was held constant at a value of 7
Tuning
 parameter 'splitrule' was held constant at a value of gini
Tuning
 parameter 'min.node.size' was held constant at a value of 10

Predict!

Apply transforms to test data

test %>% 
  mutate(Fare = ifelse(PassengerId == 1044, 
                       median((test %>% filter(!is.na(Fare), 
                                               Pclass == 3, 
                                               PassengerId != 1044))$Fare), Fare)) -> test

test %>%
  rowwise %>%
  mutate(title = gsub('(.*, )|(\\..*)', '', Name),
         child = ifelse(Age < 15, 1, 0),
         title.class = case_when(
            title %in% c("Master") ~ "Boy",
            title %in% c("Miss") & child == 1 ~ "Girl",
            title %in% c("Mrs") ~ "Woman",
            title %in% c("Miss", "Mme", "Mlle", "Ms") ~ "Lady",
            title %in% c("Dr") & Sex == "female" ~ "Woman",
            title %in% c("Mr") ~ "Man",
            title %in% c("Capt", "Don", "Col", "Major", "Dr", "Rev", 
                         "Dona", "Jonkheer", "the Countess", "Lady", "Sir") ~ "Special",
            TRUE ~ "Other"),
         family.size = SibSp + Parch,
         family.size.simple = case_when(
            family.size == 0 ~ "0",
            between(family.size, 1, 3) ~ "1-3",
            family.size > 3 ~ ">3"
          ),
         Fare.grps = case_when(
            between(Fare, 0, 100) ~ "0-100",
            Fare > 100 ~ ">100",
          ),
         Age.grps = case_when(
            between(Age, 0, 12) ~ "young",
            between(Age, 12, 50) ~ "middle",
            Age > 50 ~ "old"
          ),
         ticket.prefix = ifelse(grepl(" ", Ticket), sub(" .*", "", Ticket), ""),
         ticket.prefix = toupper(gsub("(\\.)|(\\/)", "", ticket.prefix)),
           ticket.prefix = case_when(
            ticket.prefix == "STONO" ~ "SOTONO",
            ticket.prefix == "STONOQ" ~ "SOTONO",
            ticket.prefix == "STONO2" ~ "SOTONO",
            ticket.prefix == "SOTONOQ" ~ "SOTONO",
            ticket.prefix == "SOTONO2" ~ "SOTONO",
            ticket.prefix == "SCAH" ~ "SCA",
            ticket.prefix == "SCA4" ~ "SCA",
            ticket.prefix == "SCA3" ~ "SCA",
            ticket.prefix == "PPP" ~ "PP",
            ticket.prefix == "CASOTON" ~ "CA",
            ticket.prefix == "AS" ~ "A",
            ticket.prefix == "A4" ~ "A",
            ticket.prefix == "A5" ~ "A",
            ticket.prefix == "SOP" ~ "SOPP",
            ticket.prefix == "SWPP" ~ "",
            ticket.prefix == "FA" ~ "",
            ticket.prefix == "AQ3" ~ "",
            ticket.prefix == "AQ4" ~ "",
            ticket.prefix == "LP" ~ "",
            TRUE ~ ticket.prefix),
         surname = gsub('( .*)|(*,*)', '', Name),
         surname.E = paste0(Embarked, ".", surname)) -> test.final

# Ethnicity
predict_race(test.final, surname.only = T) -> test.final

test.final %>%
  rowwise %>%
  mutate(race = case_when(
            pred.whi == max(pred.whi, pred.bla, pred.his, pred.asi, pred.oth) ~ "White",
            TRUE ~ "Non-White")) -> test.final

# Dead family members
test.final %>%
  left_join(., dead.family.df, by = 'surname.E') -> test.final

Predict using random forest

Survived <- predict(fit.rf, test.final)
Survived <- as.numeric(as.character(Survived))

test.final %>%
    cbind(., Survived) -> test.pred

test.pred %>%
    mutate(Survived = case_when(
                          is.na(n.fam) ~ Survived,
                          all.died == 1 & n.fam > 1 & Survived == 1 ~ 0,
                          TRUE ~ Survived
                        )
           ) -> test.pred

pred.rf <- data.frame(
  cbind(
    PassengerId = as.integer(as.character(test.final$PassengerId)),
    Survived = as.integer(as.character(test.pred$Survived))
    )
  )        

write_csv(pred.rf, 'rf.output.csv')