# Predicting customer churn

Let’s analyze the Telco Customer Churn from Kaggle. This data originally comes from IBM Sample Data Sets.

If you don’t already know, churn is when customers end their relationship with a company (e.g., by cancelling their subscription to a service). Companies want to retain customers, so understanding and preventing churn is naturally an important goal.

I decided to use this dataset to practice some of my newfound python skills. Plus, I’ve been working with churn at work recently, so I thought it might be interesting to work on a different dataset – one with an analysis that I can share publicly!

I set out to find a parsimonious model that would predict churn with a high level of accuracy. I used a random forest model and performed a feature selection procedure to reduce the model’s complexity.

import pandas as pd
import numpy as np

df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')


Some recoding of the variables.

## String to numeric
df.TotalCharges = pd.to_numeric(df.TotalCharges.replace(' ', 0), errors='coerce')
df.MonthlyCharges = pd.to_numeric(df.MonthlyCharges, errors='coerce')
df.tenure = pd.to_numeric(df.tenure, errors='coerce')

import warnings
warnings.filterwarnings('ignore') # Suppress warnings


## Train-test split

I’ll split the data with 80/20 train/test while stratifying on the Churn label.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df,
df.Churn,
test_size=0.2,
random_state=42,
stratify=df.Churn)


How many churn events were there?

df.Churn.value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64


## Describe the predictors

What predictor variables do I have to work with?

X_train.columns.tolist()

['customerID',
'gender',
'SeniorCitizen',
'Partner',
'Dependents',
'tenure',
'PhoneService',
'MultipleLines',
'InternetService',
'OnlineSecurity',
'OnlineBackup',
'DeviceProtection',
'TechSupport',
'StreamingTV',
'StreamingMovies',
'Contract',
'PaperlessBilling',
'PaymentMethod',
'MonthlyCharges',
'TotalCharges',
'Churn']

X_train.head()


customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
3738 4950-BDEUX Male 0 No No 35 No No phone service DSL No ... Yes No Yes Yes Month-to-month No Electronic check 49.20 1701.65 No
3151 7993-NQLJE Male 0 Yes Yes 15 Yes No Fiber optic Yes ... No No No No Month-to-month No Mailed check 75.10 1151.55 No
4860 7321-ZNSLA Male 0 Yes Yes 13 No No phone service DSL Yes ... No Yes No No Two year No Mailed check 40.55 590.35 No
3867 4922-CVPDX Female 0 Yes No 26 Yes No DSL No ... Yes No Yes Yes Two year Yes Credit card (automatic) 73.50 1905.70 No
3810 2903-YYTBW Male 0 Yes Yes 1 Yes No DSL No ... No No No No Month-to-month No Electronic check 44.55 44.55 No

5 rows × 21 columns

What are the values for each of the columns (other than customerID)?

for c in X_train.drop(['customerID'], axis=1).columns:
print("---- %s ---" % c)
print(X_train[c].value_counts())
print("")

---- gender ---
Male      2833
Female    2801
Name: gender, dtype: int64

---- SeniorCitizen ---
0    4714
1     920
Name: SeniorCitizen, dtype: int64

---- Partner ---
No     2905
Yes    2729
Name: Partner, dtype: int64

---- Dependents ---
No     3955
Yes    1679
Name: Dependents, dtype: int64

---- tenure ---
1     487
72    292
2     187
3     150
71    142
...
21     45
44     43
28     42
36     40
0       8
Name: tenure, Length: 73, dtype: int64

---- PhoneService ---
Yes    5075
No      559
Name: PhoneService, dtype: int64

---- MultipleLines ---
No                  2685
Yes                 2390
No phone service     559
Name: MultipleLines, dtype: int64

---- InternetService ---
Fiber optic    2483
DSL            1937
No             1214
Name: InternetService, dtype: int64

---- OnlineSecurity ---
No                     2797
Yes                    1623
No internet service    1214
Name: OnlineSecurity, dtype: int64

---- OnlineBackup ---
No                     2442
Yes                    1978
No internet service    1214
Name: OnlineBackup, dtype: int64

---- DeviceProtection ---
No                     2472
Yes                    1948
No internet service    1214
Name: DeviceProtection, dtype: int64

---- TechSupport ---
No                     2771
Yes                    1649
No internet service    1214
Name: TechSupport, dtype: int64

---- StreamingTV ---
No                     2226
Yes                    2194
No internet service    1214
Name: StreamingTV, dtype: int64

---- StreamingMovies ---
No                     2217
Yes                    2203
No internet service    1214
Name: StreamingMovies, dtype: int64

---- Contract ---
Month-to-month    3102
Two year          1359
One year          1173
Name: Contract, dtype: int64

---- PaperlessBilling ---
Yes    3331
No     2303
Name: PaperlessBilling, dtype: int64

---- PaymentMethod ---
Electronic check             1891
Mailed check                 1286
Bank transfer (automatic)    1244
Credit card (automatic)      1213
Name: PaymentMethod, dtype: int64

---- MonthlyCharges ---
20.05     50
19.65     37
20.00     37
20.15     36
19.95     36
..
114.05     1
40.25      1
66.20      1
110.70     1
21.45      1
Name: MonthlyCharges, Length: 1489, dtype: int64

---- TotalCharges ---
20.20      9
0.00       8
19.65      7
45.30      7
20.05      6
..
625.05     1
901.25     1
666.00     1
4036.00    1
189.45     1
Name: TotalCharges, Length: 5276, dtype: int64

---- Churn ---
No     4139
Yes    1495
Name: Churn, dtype: int64


Most of these variables are categorical but a few (TotalCharges, MonthlyCharges,tenure) are continuous.

## Exploratory Data Analysis

### Univariate distributions

import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats


#### TotalCharges

sns.distplot(X_train.TotalCharges)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe489aac9b0>


TotalCharges is probably correlated with tenure and might be redundant if I were to include both in a model. Though I might be able to get an “average monthly” by dividing the two variables, and it might not be the same as MonthlyCharges (which I assume is current-monthly). I noticed some missing values that I had to replace earlier with 0, but there weren’t many. Maybe these are the newest customers (who have no charges yet). Or maybe they’re errors that are meaningful somehow.

X_train[X_train.tenure == 0]['TotalCharges']

6670    0.0
4380    0.0
3826    0.0
488     0.0
1082    0.0
1340    0.0
6754    0.0
3331    0.0
Name: TotalCharges, dtype: float64


It looks like TotalCharges is indeed 0 for customers with 0 tenure. These customers should be removed at the stage of analysis because by definition they cannot have churned.

#### MonthlyCharges

sns.distplot(X_train.MonthlyCharges)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe489f80f98>


Interesting bi-modal distribution. One group of customers is paying relatively little, while another group is paying much more. I’m sure this reflects the number and type of services customers are subscribed to. Unlike TotalCharges, there were no missing values for MonthlyCharges.

#### tenure

sns.distplot(X_train.tenure)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe48a026dd8>


Another bi-modal distribution. One peak is customers who have been tenured for a very long time; the other peak is customers who joined very recently. I’m going to assume tenure is in the unit of “months”, because this is how services are billed.

### Bivariate

#### Churn by TotalCharges

bins = 30
plt.hist(X_train[X_train.Churn == 'Yes'].TotalCharges,
bins, alpha=0.5, density=True, label='Churned')
plt.hist(X_train[X_train.Churn == 'No'].TotalCharges,
bins, alpha=0.5, density=True, label="Didn't Churn")
plt.legend(loc='upper right')
plt.show()


Customers who didn’t churn had more TotalCharges. No surprise there. :)

#### Churn by MonthlyCharges

bins = 30
plt.hist(X_train[X_train.Churn == 'Yes'].MonthlyCharges,
bins, alpha=0.5, density=True, label='Churned')
plt.hist(X_train[X_train.Churn == 'No'].MonthlyCharges,
bins, alpha=0.5, density=True, label="Didn't Churn")
plt.legend(loc='upper right')
plt.show()


This is interesting. Customers who churned seemed to be more likely located towards the higher-end of the MonthlyCharges distribution, and customers who didn’t churn were more likely to be paying the least per month.

#### Churn by tenure

bins = 30
plt.hist(X_train[X_train.Churn == 'Yes'].tenure,
bins, alpha=0.5, density=True, label='Churned')
plt.hist(X_train[X_train.Churn == 'No'].tenure,
bins, alpha=0.5, density=True, label="Didn't Churn")
plt.legend(loc='upper right')
plt.show()


The first few months of tenure – between months 1-10 – seem to be critical, as this is when most of the churn is happening. There also seems to be a kind of diverging pattern in the middle. Maybe this is some kind of a survivor effect – members who make it past a certain tenure length are more likely to stick it out for the long-term.

#### Churn by categorical variables

catvars = X_train.columns.tolist()
catvars = [e for e in catvars if e not in ('TotalCharges', 'MonthlyCharges',
'tenure', 'customerID', 'Churn')]

y = 'Churn'
for x in catvars:
plot = X_train.groupby(x)[y]\
.value_counts(normalize=True).mul(100)\
.rename('percent').reset_index()\
.pipe((sns.catplot,'data'), x=x, y='percent', hue=y, kind='bar')
plot.fig.suptitle("Churn by " + x)
plot


The most interesting variables seem to be:

• SeniorCitizen
• Partner
• Dependents
• InternetService
• OnlineSecurity
• OnlineBackup
• DeviceProtection
• TechSupport
• Contract
• PaperlessBilling
• PaymentMethod

There seems to be a latent variable underlying many of these that reflects either tech-savviness, or service benefit maximization – or both. For example, customers with InternetService are less likely to churn if they have some of the other services, like TechSupport, OnlineSecurity, or OnlineBackup. A tech-savvy person would not be paying for these services but would “roll their own”.

I wonder how these “supplemental service” variables are related. Maybe they are redundant.

#### TotalCharges by tenure

np.corrcoef(X_train.TotalCharges, X_train.tenure)

array([[1.        , 0.82969844],
[0.82969844, 1.        ]])

plt.scatter(X_train.TotalCharges, X_train.tenure)
plt.show()


They are highly correlated, but there is an interesting funneling pattern. I wouldn’t expect them to be perfectly correlated, because TotalCharges won’t grow linearly as a function of tenure if services change over time. Maybe this pattern reflects that noise.

#### Supplemental Internet services

I’ll exclude customers without InternetService because they wouldn’t be eligible.

OnlineBackup vs. OnlineSecurity

x,y = 'OnlineBackup', 'OnlineSecurity'
plot = X_train[X_train.InternetService != 'No'].groupby(x)[y]\
.value_counts(normalize=True).mul(100)\
.rename('percent').reset_index()\
.pipe((sns.catplot,'data'), x=x, y='percent', hue=y, kind='bar')


TechSupport vs. OnlineSecurity

x,y = 'TechSupport', 'OnlineSecurity'
plot = X_train[X_train.InternetService != 'No'].groupby(x)[y]\
.value_counts(normalize=True).mul(100)\
.rename('percent').reset_index()\
.pipe((sns.catplot,'data'), x=x, y='percent', hue=y, kind='bar')


StreamingTV vs. StreamingMovies

x,y = 'StreamingMovies', 'StreamingTV'
plot = X_train[X_train.InternetService != 'No'].groupby(x)[y]\
.value_counts(normalize=True).mul(100)\
.rename('percent').reset_index()\
.pipe((sns.catplot,'data'), x=x, y='percent', hue=y, kind='bar')


DeviceProtection vs. OnlineSecurity

x,y = 'DeviceProtection', 'OnlineSecurity'
plot = X_train[X_train.InternetService != 'No'].groupby(x)[y]\
.value_counts(normalize=True).mul(100)\
.rename('percent').reset_index()\
.pipe((sns.catplot,'data'), x=x, y='percent', hue=y, kind='bar')


There’s a fair bit of redundancy (especially with streaming TV vs. movies), but they’re not completely redundant. I think there’s enough variance to keep them as independent predictors.

## Modeling

Before fitting any model, I will need to transform the categorical variables and remove customerID.

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
le = LabelEncoder()

catvars = X_train.columns.tolist()
catvars = [e for e in catvars if e not in ('TotalCharges', 'MonthlyCharges',
'tenure', 'customerID')]

for x in catvars:
X_train[x] = le.fit_transform(X_train[x])
X_test[x] = le.fit_transform(X_test[x])

y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)

X_train = X_train.drop(['Churn', 'customerID'], axis=1)
X_test = X_test.drop(['Churn', 'customerID'], axis=1)


### Model 1 - Kitchen sink

I’ll start by building a “kitchen sink” random forest model. I’ll use RandomizedSearchCV to perform a randomize grid search over a set of model parameters.

random_grid = {'n_estimators': [int(x) for x in np.linspace(10, 50, num = 5)],
'max_features': ['auto','sqrt'],
'max_depth': [int(x) for x in np.linspace(10, 50, num = 10)],
'min_samples_split': [int(x) for x in np.linspace(2, 11, num = 9)],
'min_samples_leaf': [int(x) for x in np.linspace(2, 11, num = 9)],
'bootstrap': [True, False]}

rf_random = RandomizedSearchCV(estimator = RandomForestClassifier(),
param_distributions = random_grid,
n_iter=500,
cv=3,
verbose=2,
random_state=42,
scoring='accuracy',
n_jobs = -1)

rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 500 candidates, totalling 1500 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    7.6s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   17.5s
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed:   32.1s
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:   50.7s
[Parallel(n_jobs=-1)]: Done 1005 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 1450 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed:  1.8min finished

RandomizedSearchCV(cv=3, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs...
iid='deprecated', n_iter=500, n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'max_depth': [10, 14, 18, 23, 27, 32,
36, 41, 45, 50],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [2, 3, 4, 5, 6, 7,
8, 9, 11],
'min_samples_split': [2, 3, 4, 5, 6, 7,
8, 9, 11],
'n_estimators': [10, 20, 30, 40, 50]},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score=False, scoring='accuracy', verbose=2)


#### Model score

rf_random.best_score_

0.805999290024849


#### Out-of-sample prediction

pred = rf_random.predict(X_test)
errors = abs(pred - y_test)
1 - (sum(errors) / len(X_test.index))

0.7977288857345635


This is pretty good, but the model is not very parsimonious. As expected, some of the features did not add much.

### Model 2 - Reduce feature level redundancy

There’s some redundancy in the levels of some features. This is because InternetService is “baked into” other features that depend on it. These variables, like OnlineSecurity, have a “No internet service” level. But if we’re already taking InternetService into account, then that level is unnecessary and could skew how the feature levels are weighted. For this model, I’ll recode “No internet serice” values to “No”.

df_recode = df
int_vars = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies']

for x in int_vars:
df_recode.loc[df_recode[x] == 'No internet service', x] = 'No'

X_train, X_test, y_train, y_test = train_test_split(df_recode,
df.Churn,
test_size=0.2,
random_state=42,
stratify=df.Churn)

catvars = X_train.columns.tolist()
catvars = [e for e in catvars if e not in ('TotalCharges', 'MonthlyCharges',
'tenure', 'customerID')]
for x in catvars:
X_train[x] = le.fit_transform(X_train[x])
X_test[x] = le.fit_transform(X_test[x])

y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)

X_train = X_train.drop(['Churn', 'customerID'], axis=1)
X_test = X_test.drop(['Churn', 'customerID'], axis=1)

rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 500 candidates, totalling 1500 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 300 tasks      | elapsed:   19.5s
[Parallel(n_jobs=-1)]: Done 561 tasks      | elapsed:   41.1s
[Parallel(n_jobs=-1)]: Done 844 tasks      | elapsed:   59.1s
[Parallel(n_jobs=-1)]: Done 1209 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed:  1.7min finished

RandomizedSearchCV(cv=3, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs...
iid='deprecated', n_iter=500, n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'max_depth': [10, 14, 18, 23, 27, 32,
36, 41, 45, 50],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [2, 3, 4, 5, 6, 7,
8, 9, 11],
'min_samples_split': [2, 3, 4, 5, 6, 7,
8, 9, 11],
'n_estimators': [10, 20, 30, 40, 50]},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score=False, scoring='accuracy', verbose=2)


#### Out-of-sample prediction

pred = rf_random.predict(X_test)
errors = abs(pred - y_test)
1 - (sum(errors) / len(X_test.index))

0.7963094393186657


The accuracy hasn’t changed. This is good.

#### Feature importance

features_imp = list(zip(X_train.columns, rf_random.best_estimator_.feature_importances_))
features_imp.sort(key=lambda x: x[1], reverse=True)
features_imp

[('tenure', 0.17480257510722047),
('MonthlyCharges', 0.1685986986317731),
('TotalCharges', 0.1621832404990236),
('Contract', 0.1460694596356221),
('InternetService', 0.0764545302723155),
('PaymentMethod', 0.041984125999528034),
('OnlineSecurity', 0.038104291064396036),
('PaperlessBilling', 0.036661780564273265),
('TechSupport', 0.022116314064718583),
('OnlineBackup', 0.016840414057895),
('MultipleLines', 0.016371002760704433),
('DeviceProtection', 0.016030071580043416),
('StreamingTV', 0.01584585756896422),
('gender', 0.014675035134160228),
('Dependents', 0.01440662583031926),
('SeniorCitizen', 0.011962023431441905),
('Partner', 0.011185987835762707),
('StreamingMovies', 0.010686697048558088),
('PhoneService', 0.005021268913280056)]


### Model 3 - Feature selection by importance

There are still a LOT of features in the model, and some are not very important. One approach to model reduction is to set a threshold for feature importance, and only keep the features above that threshold. For example, the default method in SelectFromModel keeps features with importances greater than the mean of the importances. I’ll try that here.

fpd = pd.DataFrame(features_imp)
fpd


0 1
0 tenure 0.174803
1 MonthlyCharges 0.168599
2 TotalCharges 0.162183
3 Contract 0.146069
4 InternetService 0.076455
5 PaymentMethod 0.041984
6 OnlineSecurity 0.038104
7 PaperlessBilling 0.036662
8 TechSupport 0.022116
9 OnlineBackup 0.016840
10 MultipleLines 0.016371
11 DeviceProtection 0.016030
12 StreamingTV 0.015846
13 gender 0.014675
14 Dependents 0.014407
15 SeniorCitizen 0.011962
16 Partner 0.011186
17 StreamingMovies 0.010687
18 PhoneService 0.005021
fpd[fpd[1] > np.mean(fpd[1])]


0 1
0 tenure 0.174803
1 MonthlyCharges 0.168599
2 TotalCharges 0.162183
3 Contract 0.146069
4 InternetService 0.076455

Using the mean as a threshold, the model would be reduced from 19 to 5 features.

features = X_train[fpd[fpd[1] > np.mean(fpd[1])][0].tolist()]
rf_random.fit(features, y_train)

Fitting 3 folds for each of 500 candidates, totalling 1500 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 300 tasks      | elapsed:   19.4s
[Parallel(n_jobs=-1)]: Done 706 tasks      | elapsed:   47.3s
[Parallel(n_jobs=-1)]: Done 1272 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 1493 out of 1500 | elapsed:  1.7min remaining:    0.5s
[Parallel(n_jobs=-1)]: Done 1500 out of 1500 | elapsed:  1.7min finished

RandomizedSearchCV(cv=3, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True,
ccp_alpha=0.0,
class_weight=None,
criterion='gini',
max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_jobs...
iid='deprecated', n_iter=500, n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'max_depth': [10, 14, 18, 23, 27, 32,
36, 41, 45, 50],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [2, 3, 4, 5, 6, 7,
8, 9, 11],
'min_samples_split': [2, 3, 4, 5, 6, 7,
8, 9, 11],
'n_estimators': [10, 20, 30, 40, 50]},
pre_dispatch='2*n_jobs', random_state=42, refit=True,
return_train_score=False, scoring='accuracy', verbose=2)


#### Out-of-sample prediction

features = X_test[fpd[fpd[1] > np.mean(fpd[1])][0].tolist()]
pred = rf_random.predict(features)
errors = abs(pred - y_test)
1 - (sum(errors) / len(X_test.index))

0.7906316536550745


Pretty good! We’ve managed to reduce the feature set substantially without losing a lot of accuracy. What we’ve lost in accuracy we’ve gained in explanatory power and generalizability. For example, we now know that with just 5 features we can predict with 79.1% accuracy who is going to churn. These 5 features could also be targets for interventions.

One last check… to make sure we didn’t predict churn for anyone with a tenure of 0 months.

tmp = features.reset_index()
no_tenures = tmp[tmp.tenure == 0].index.tolist()
for i in no_tenures:
print(pred[i])

0
0
0


Nope, all good!