# Predicting survival on the Titanic (with Python!)

I wanted to try out some of what I’ve learned with python for data science. I thought: Why not try it on the Kaggle Titanic challenge? Previously, I tackled this challenge using R. Here I build logistic regression and random forest models, after doing some wrangling and feature engineering.

import pandas as pd
import numpy as np
import zipfile

z = zipfile.ZipFile('titanic.zip')

train.describe()


PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train.head()


PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The Survived classes are unbalanced, so I should use stratification for the split later.

train['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64


Some of the columns have many missing values.

891 - (train.count())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


# Logistic regression classifier

When modeling, it’s best to keep things simple: fewer features, less pre-processing, and a simpler model. For the first classifier type I will use a logistic regression.

### Model 1

#### Subset the data

Make a dataset containing the columns that need the least pre-processing.

df = train.drop(['PassengerId', 'Survived', 'Cabin', 'Age', 'Name', 'Ticket'], axis=1)


Make dummy variables, dropping the first to avoid redundancy.

df = pd.get_dummies(df, drop_first=True)

df.head()


Pclass SibSp Parch Fare Sex_male Embarked_Q Embarked_S
0 3 1 0 7.2500 1 0 1
1 1 1 0 71.2833 0 0 0
2 3 0 0 7.9250 0 0 1
3 1 1 0 53.1000 0 0 1
4 3 0 0 8.0500 1 0 1

#### Train-test split

Make an 80/20 train-test split with stratification.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

print (X_train.shape, X_test.shape,
y_train.shape, y_test.shape)

(712, 7) (179, 7) (712,) (179,)


#### Logistic regression pipeline

Logistic regression pipeline with the StandardScaler as a pre-processing step.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline(steps=[('scaler', StandardScaler()),
('logistic', LogisticRegression())])


Fit and score

fit = pipe.fit(X_train, y_train)
fit.score(X_test, y_test)

0.8044692737430168


Score with cross-validation

from sklearn.model_selection import cross_val_score

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.776984126984127


#### Coefficient performance

Not all predictors performed equally well. Embarked_Q and Parch were relatively poor predictors.

list(zip(df.columns, fit.named_steps['logistic'].coef_))

[('Pclass', -0.7255791345702128),
('SibSp', -0.1629967238390473),
('Parch', -0.06060899725052046),
('Fare', 0.11742734439032),
('Sex_male', -1.2741670004200136),
('Embarked_Q', 0.03622955414064328),
('Embarked_S', -0.20179422191456092)]


### Model 2 - Adding age

Next let’s try adding Age. Because Age contains missing values, the feature will need to be transformed or imputed.

#### Age with median imputation

from sklearn.impute import SimpleImputer
pipe = Pipeline(steps=[('imputer', SimpleImputer(strategy = 'median')),
('scaler', StandardScaler()),
('logistic', LogisticRegression())])

df = train.drop(['PassengerId', 'Survived', 'Cabin', 'Name', 'Ticket'], axis=1)
df = pd.get_dummies(df, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7660317460317461


The model performed about the same, maybe a little worse in CV scoring.

#### Age with KNN imputation

Maybe median imputation was too simple. Let’s try a more complex imputer – K Nearest Neighbors imputation.

from sklearn.impute import KNNImputer
pipe = Pipeline(steps=[('imputer', KNNImputer()),
('scaler', StandardScaler()),
('logistic', LogisticRegression())])

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7660317460317461


The CV results are the same.

#### Age as age missing-ness

Maybe what’s useful about Age is not its value but its missing-ness. I’ll try encoding this and seeing if it adds predictive value.

df = train.drop(['PassengerId', 'Survived', 'Cabin', 'Name', 'Ticket'], axis=1)

df['Age'] = 1 * np.isnan(df['Age'])
df = pd.get_dummies(df, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

df['Age'].value_counts()

0    714
1    177
Name: Age, dtype: int64


Remove imputation from the pipeline.

pipe = Pipeline(steps=[('scaler', StandardScaler()),
('logistic', LogisticRegression())])

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7715873015873015


Better than imputation, but worse than the first model without Age.

#### Age as adult vs. child

Maybe coding for under-21 years of age is useful. It could be useful because of the way “women and children” (and particularly “marriage-able” women) were treated in this time period.

df = train.drop(['PassengerId', 'Survived', 'Cabin', 'Name', 'Ticket'], axis=1)

df['Age'] = 1 * (df['Age'] <= 21)
df = pd.get_dummies(df, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

df['Age'].value_counts()

0    687
1    204
Name: Age, dtype: int64

pipe = Pipeline(steps=[('scaler', StandardScaler()),
('logistic', LogisticRegression())])

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7826984126984127


Now we’re seeing some improvement, suggesting Age is worth keeping in the model.

### Model 3 - Family

The current model contains Parch (number of parents on board) and SibSp (number of siblings on board). But it’s not clear if these are useful to maintain as separate precictors or if they can be combined into a single “number of family on board” variable.

#### Combined family

df = train.drop(['PassengerId', 'Survived', 'Cabin', 'Name', 'Ticket'], axis=1)

df['Age'] = 1 * (df['Age'] <= 21)
df['Family'] = df['SibSp'] + df['Parch']
df = df.drop(['SibSp', 'Parch'], axis=1)
df = pd.get_dummies(df, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7882539682539683


This improved the model slightly.

pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
errors = abs(pred - y_test)
1 - (sum(errors) / 179)

0.7877094972067039


Out-of-sample error was similar.

#### Family vs. no family

Interestingly, if I look at the counts, I can see that there was a majority of passengers with no family on board. I wonder if family is more useful as a boolean “has family vs. doesn’t have family”.

df['Family'].value_counts()

0     537
1     161
2     102
3      29
5      22
4      15
6      12
10      7
7       6
Name: Family, dtype: int64

df = train.drop(['PassengerId', 'Survived', 'Cabin', 'Name', 'Ticket'], axis=1)

df['Age'] = 1 * (df['Age'] <= 21)
df['Family'] = 1 * ((df['SibSp'] + df['Parch']) == 0)
df = df.drop(['SibSp', 'Parch'], axis=1)
df = pd.get_dummies(df, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7492063492063492


This made the model much worse, so I guess not.

### Model 3 - Adding name features

Next we’ll try extracting information from passenger names.

#### Special persons

train.Name.head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object


Let’s extract the last names and honorifics.

split1 = train["Name"].str.split(', ', n = 1, expand = True)


0 1
0 Braund Mr. Owen Harris
1 Cumings Mrs. John Bradley (Florence Briggs Thayer)
2 Heikkinen Miss. Laina
3 Futrelle Mrs. Jacques Heath (Lily May Peel)
4 Allen Mr. William Henry
split2 = split1.str.split('.', n = 1, expand = True)


0 1
0 Mr Owen Harris
1 Mrs John Bradley (Florence Briggs Thayer)
2 Miss Laina
3 Mrs Jacques Heath (Lily May Peel)
4 Mr William Henry
train['LastName'] = split1
train['Honorific'] = split2
train['OtherNames'] = split2

train.Honorific.value_counts()

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Col               2
Major             2
Mlle              2
Capt              1
Mme               1
Jonkheer          1
Don               1
the Countess      1
Sir               1
Ms                1
Name: Honorific, dtype: int64


Some of these are “special classes” of people. We’ll classify anyone other than Mr, Miss, Mrs, Master, and Ms. as a special person.

df = train.drop(['PassengerId', 'Survived', 'Cabin', 'Name', 'Ticket'], axis=1)

df['SpecialPerson'] = 1 * df['Honorific'].isin(['Dr', 'Rev', 'Mlle', 'Major', 'Col', 'Don', 'Jonkheer', 'Lady', 'the Countess', 'Sir', 'Mme', 'Capt'])
df['Age'] = 1 * (df['Age'] <= 21)
df['Family'] = 1 * ((df['SibSp'] + df['Parch']) == 0)

df = df.drop(['SibSp', 'Parch', 'LastName', 'OtherNames', 'Honorific'], axis=1)
df = pd.get_dummies(df, drop_first=True)

df['SpecialPerson'].value_counts()

0    865
1     26
Name: SpecialPerson, dtype: int64

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7379365079365079


Much worse. It could be because the base-rate is so low.

#### Single woman

df = train.drop(['PassengerId', 'Survived', 'Cabin', 'Name', 'Ticket'], axis=1)

df['SingleWoman'] = 1 * df['Honorific'].isin(['Miss', 'Mlle'])
df['Age'] = 1 * (df['Age'] <= 21)
df['Family'] = 1 * ((df['SibSp'] + df['Parch']) == 0)

df = df.drop(['SibSp', 'Parch', 'LastName', 'OtherNames', 'Honorific'], axis=1)
df = pd.get_dummies(df, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7547619047619049


This also made the model worse.

### Model 4 - Cabin

#### Cabin letter

As I understand, Cabin contains a letter that corresponds with how many levels below-deck the passenger’s room was. Cabin A was closest to the surface, while cabin G was the lowest level. This variable also has many missing values, which also might be informative. We’ll assign U to these missing values.

train.Cabin.value_counts().head()

G6             4
C23 C25 C27    4
B96 B98        4
D              3
E101           3
Name: Cabin, dtype: int64

train.Cabin.str[:1].fillna('U').value_counts()

U    687
C     59
B     47
D     33
E     32
A     15
F     13
G      4
T      1
Name: Cabin, dtype: int64

df = train.drop(['PassengerId', 'Survived', 'Name', 'Ticket', 'LastName', 'OtherNames', 'Honorific'], axis=1)

df['Age'] = 1 * (df['Age'] <= 21)
df['Family'] = 1 * ((df['SibSp'] + df['Parch']) == 0)
df['CabinLetter'] = df['Cabin'].str[:1].fillna('U')

df = df.drop(['SibSp', 'Parch', 'Cabin'], axis=1)
df = pd.get_dummies(df, drop_first=True)

df.head()


Pclass Age Fare Family Sex_male Embarked_Q Embarked_S CabinLetter_B CabinLetter_C CabinLetter_D CabinLetter_E CabinLetter_F CabinLetter_G CabinLetter_T CabinLetter_U
0 3 0 7.2500 0 1 0 1 0 0 0 0 0 0 0 1
1 1 0 71.2833 0 0 0 0 0 1 0 0 0 0 0 0
2 3 0 7.9250 1 0 0 1 0 0 0 0 0 0 0 1
3 1 0 53.1000 0 0 0 1 0 1 0 0 0 0 0 0
4 3 0 8.0500 1 1 0 1 0 0 0 0 0 0 0 1
X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7320634920634921


#### Cabin as cabin missing-ness

I’ll try a simpler version that only encodes the missing Cabin values.

df = train.drop(['PassengerId', 'Survived', 'Name', 'Ticket', 'LastName', 'OtherNames', 'Honorific'], axis=1)

df['Age'] = 1 * (df['Age'] <= 21)
df['Family'] = 1 * ((df['SibSp'] + df['Parch']) == 0)
df['CabinMissing'] = 1 * (df['Cabin'].str[:1].fillna('U') == 'U')

df = df.drop(['SibSp', 'Parch', 'Cabin'], axis=1)
df = pd.get_dummies(df, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)

cv_results = cross_val_score(pipe,
X_test,
y_test,
cv=5,
scoring='accuracy')

cv_results.mean()

0.7434920634920635


The model is still worse.

## Final model

Now to train a predictor using all of the training data, before applying it to the test set.

train_features = train.drop(['PassengerId', 'Survived', 'Cabin', 'Name', 'Ticket', 'Honorific', 'LastName', 'OtherNames'], axis=1)

train_features['Age'] = 1 * (train_features['Age'] <= 21)
train_features['Family'] = train_features['SibSp'] + train_features['Parch']
train_features = train_features.drop(['SibSp', 'Parch'], axis=1)
train_features = pd.get_dummies(train_features, drop_first=True)

pipe = Pipeline(steps=[('scaler', StandardScaler()),
('logistic', LogisticRegression())])

pipe.fit(train_features, train.Survived)

Pipeline(memory=None,
steps=[('scaler',
StandardScaler(copy=True, with_mean=True, with_std=True)),
('logistic',
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None,
penalty='l2', random_state=None,
solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False))],
verbose=False)

train_features


Pclass Age Fare Family Sex_male Embarked_Q Embarked_S
0 3 0 7.2500 1 1 0 1
1 1 0 71.2833 1 0 0 0
2 3 0 7.9250 0 0 0 1
3 1 0 53.1000 1 0 0 1
4 3 0 8.0500 0 1 0 1
... ... ... ... ... ... ... ...
886 2 0 13.0000 0 1 0 1
887 1 1 30.0000 0 0 0 1
888 3 0 23.4500 3 0 0 1
889 1 0 30.0000 0 1 0 0
890 3 0 7.7500 0 1 1 0

891 rows × 7 columns

test_features = test.drop(['PassengerId', 'Cabin', 'Name', 'Ticket'], axis=1)

test_features['Age'] = 1 * (test_features['Age'] <= 21)
test_features['Family'] = test_features['SibSp'] + test_features['Parch']
test_features = test_features.drop(['SibSp', 'Parch'], axis=1)
test_features = pd.get_dummies(test_features, drop_first=True)


Check for NaN values.

test_features[test_features.isna().any(axis=1)]


Pclass Age Fare Family Sex_male Embarked_Q Embarked_S
152 3 0 NaN 0 1 0 1

Fill in NaN value with median.

test_features.Fare = test_features.Fare.fillna(np.nanmedian(test_features.Fare))
test_features[test_features.isna().any(axis=1)]


Pclass Age Fare Family Sex_male Embarked_Q Embarked_S
d = {'PassengerId': test.PassengerId, 'Survived': preds}
submission = pd.DataFrame(data=d)
submission.to_csv("submission_lr.csv", index=False)


# Decision tree classifier

I have a feeling that some of the features I tested using the logistic regression classifier did not add predictive power to the model because the interactions between terms were not accounted for. If I use a decision tree classifier, then some of these interactions may be accounted for, and this may improve the accuracy of the model. Another advantage of the decision tree classifier is that it can use categorical features without one-hot / dummy encoding. This means I can simplify the model by doing less feature-engineering guess-work.

#### Features

I’ll start by collecting together all of the features that intuition tells me might be of some use.

split1 = train["Name"].str.split(', ', n = 1, expand = True)
split2 = split1.str.split('.', n = 1, expand = True)
train['Honorific'] = split2

df = train.drop(['PassengerId', 'Survived', 'Ticket', 'Name'], axis=1)

df['Child'] = 1 * (df['Age'] <= 21)
df['CabinLetter'] = df['Cabin'].str[:1].fillna('U')

df = df.drop(['SibSp', 'Parch', 'Cabin', 'Age'], axis=1)


#### Label encoding

I’ll need to encode the string features to integers using LabelEncoder.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

le.fit(df.CabinLetter)
df['CabinLetterLE'] = le.transform(df.CabinLetter)

le.fit(df.Sex)
df['SexLE'] = le.transform(df.Sex)

le.fit(df.Honorific)
df['HonorificLE'] = le.transform(df.Honorific)

df.Embarked = df.Embarked.fillna('S') ## Mode imputatation
le.fit(df.Embarked)
df['EmbarkedLE'] = le.transform(df.Embarked)

df = df.drop(['CabinLetter', 'Sex', 'Honorific', 'Embarked'], axis=1)

df.head()


Pclass Fare Child CabinLetterLE SexLE HonorificLE EmbarkedLE
0 3 7.2500 0 8 1 11 2
1 1 71.2833 0 2 0 12 0
2 3 7.9250 0 8 0 8 2
3 1 53.1000 0 2 0 12 2
4 3 8.0500 0 8 1 11 2
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df,
train.Survived,
test_size=0.2,
random_state=42,
stratify=train.Survived)


#### Model tuning

I’ll prepare a RandomizedSearchCV to find the best set of model parameters.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

random_grid = {'n_estimators': [int(x) for x in np.linspace(10, 50, num = 5)],
'max_features': ['auto','sqrt'],
'max_depth': [int(x) for x in np.linspace(10, 50, num = 10)],
'min_samples_split': [int(x) for x in np.linspace(2, 11, num = 9)],
'min_samples_leaf': [int(x) for x in np.linspace(2, 11, num = 9)],
'bootstrap': [True, False]}
random_grid

{'n_estimators': [10, 20, 30, 40, 50],
'max_features': ['auto', 'sqrt'],
'max_depth': [10, 14, 18, 23, 27, 32, 36, 41, 45, 50],
'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9, 11],
'min_samples_leaf': [2, 3, 4, 5, 6, 7, 8, 9, 11],
'bootstrap': [True, False]}

rf_random = RandomizedSearchCV(estimator = RandomForestClassifier(),
param_distributions = random_grid,
n_iter = 1000,
cv = 3,
verbose=2,
random_state=42,
scoring='accuracy',
n_jobs = -1)

rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   13.9s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:   28.1s
[Parallel(n_jobs=-1)]: Done 1764 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 2826 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 3000 out of 3000 | elapsed:  1.7min finished

RandomizedSearchCV(cv=3, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs...
iid='deprecated', n_iter=1000, n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'max_depth': [10, 14, 18, 23, 27, 32,
36, 41, 45, 50],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [2, 3, 4, 5, 6, 7,
8, 9, 11],
'min_samples_split': [2, 3, 4, 5, 6, 7,
8, 9, 11],
'n_estimators': [10, 20, 30, 40, 50]},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score=False, scoring='accuracy', verbose=2)

rf_random.best_score_

0.8286707087898452


The best score here is better than it was with the logistic regression classifier.

list(zip(df.columns, rf_random.best_estimator_.feature_importances_))

[('Pclass', 0.12286637673196617),
('Fare', 0.2619939060630186),
('Child', 0.020292294810431686),
('CabinLetterLE', 0.09029715750382582),
('SexLE', 0.3481062329792616),
('HonorificLE', 0.1126016235990436),
('EmbarkedLE', 0.0438424083124526)]


All of the features were at least a little important.

pred = rf_random.predict(X_test)
errors = abs(pred - y_test)
1 - (sum(errors) / 179)

0.782122905027933


The out-of-sample prediction is about the same as the Logistic Regression model 3.

## Final model

Now to train a predictor using all of the training data, before applying it to the test set.

train_features = train.drop(['PassengerId', 'Survived', 'Ticket', 'Name'], axis=1)
train_labels = train.Survived

train_features['Child'] = 1 * (train_features['Age'] <= 21)
train_features['CabinLetter'] = train_features['Cabin'].str[:1].fillna('U')

train_features = train_features.drop(['SibSp', 'Parch', 'Cabin', 'Age'], axis=1)

## Label encoder
le.fit(train_features.CabinLetter)
train_features['CabinLetterLE'] = le.transform(train_features.CabinLetter)

le.fit(train_features.Sex)
train_features['SexLE'] = le.transform(train_features.Sex)

le.fit(train_features.Honorific)
train_features['HonorificLE'] = le.transform(train_features.Honorific)

train_features.Embarked = train_features.Embarked.fillna('S') ## Mode imputatation
le.fit(train_features.Embarked)
train_features['EmbarkedLE'] = le.transform(train_features.Embarked)

train_features = train_features.drop(['CabinLetter', 'Sex', 'Honorific', 'Embarked'], axis=1)

rf_random = RandomizedSearchCV(estimator = RandomForestClassifier(),
param_distributions = random_grid,
n_iter = 1000,
cv = 3,
verbose=2,
random_state=42,
scoring='accuracy',
n_jobs = -1)

rf_random.fit(train_features, train_labels)

test_features = test

## Generate Honorific feature
split1 = test["Name"].str.split(', ', n = 1, expand = True)
split2 = split1.str.split('.', n = 1, expand = True)
test_features['Honorific'] = split2

test_features['Child'] = 1 * (test_features['Age'] <= 21)
test_features['CabinLetter'] = test_features['Cabin'].str[:1].fillna('U')

test_features = test_features.drop(['SibSp', 'Parch', 'Cabin', 'Age', 'PassengerId', 'Ticket', 'Name'], axis=1)

## Label encoder
le.fit(test_features.CabinLetter)
test_features['CabinLetterLE'] = le.transform(test_features.CabinLetter)

le.fit(test_features.Sex)
test_features['SexLE'] = le.transform(test_features.Sex)

le.fit(test_features.Honorific)
test_features['HonorificLE'] = le.transform(test_features.Honorific)

test_features.Embarked = test_features.Embarked.fillna('S') ## Mode imputatation
le.fit(test_features.Embarked)
test_features['EmbarkedLE'] = le.transform(test_features.Embarked)

test_features = test_features.drop(['CabinLetter', 'Sex', 'Honorific', 'Embarked'], axis=1)


Check for NaN values.

test_features[test_features.isna().any(axis=1)]


Pclass Fare Child CabinLetterLE SexLE HonorificLE EmbarkedLE
152 3 NaN 0 7 1 5 2

Fill in NaN value with median.

test_features.Fare = test_features.Fare.fillna(np.nanmedian(test_features.Fare))
test_features[test_features.isna().any(axis=1)]


Pclass Fare Child CabinLetterLE SexLE HonorificLE EmbarkedLE
preds = rf_random.predict(test_features)

d = {'PassengerId': test.PassengerId, 'Survived': preds}
submission = pd.DataFrame(data=d)

submission.to_csv("submission_rf.csv", index=False)