Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Using a logistic regression model to predict heart disease

I decided to explore and model the Heart Disease UCI dataset from Kaggle. The original source can be found at the UCI Machine Learning Repository. The dataset contains 303 individuals and 14 attribute observations (the original source data contains additional features). The features included various heart disease-related measurements, like chest pain and resting ECG, as well as age and sex, and represented a mix of binary, categorical, ordinal, and numeric data. The outcome variable represented the presence or absence of heart disease.

I applied a fairly standard machine learning process, beginning with understanding what the data represented -- this involved looking at the raw data and reading the data dictionary and other documentation -- followed by splitting the data into training and test sets, Exploratory Data Analysis to understand how the features relate to the outcome, feature encoding, model fitting (logistic regression), and model evaluation.

Ultimately, the model was 75% accurate in predicting heart disease. It correctly predicted 46 out of the 61 test cases. Of the errors it made, 5 were false-positive errors and 10 were false-negative errors.

Imports

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import kaggle

# Download data from Kaggle
# !kaggle datasets download ronitf/heart-disease-uci -f heart.csv -p heart_uci
df = pd.read_csv('heart.csv')
df.head()
agesexcptrestbpscholfbsrestecgthalachexangoldpeakslopecathaltarget
063131452331015002.30011
137121302500118703.50021
241011302040017201.42021
356111202360117800.82021
457001203540116310.62021

When I started to explore the data, I noticed that many of the parameters that I would expect from my lay knowledge of heart disease to be positively correlated, were actually pointed in the opposite direction. After reading through some comments in the Kaggle discussion forum, I discovered that others had come to a similar conclusion: the target variable was reversed.

So here I flip it back to how it should be (1 = heart disease; 0 = no heart disease).

df.target = df.target.replace({0:1, 1:0})

I’ll check the target classes to see how balanced they are.

df.target.value_counts()
0    165
1    138
Name: target, dtype: int64

Not particularly unbalanced, but we’ll stratify in the train-test split later anyway.

Feature descriptions

First I’ll put the features into groups based on the type of data contained. In order to make these groupings, I’ve used the data dictionary, as well as some helpful comments on Kaggle notebooks and discussion. Some of the documentation turned out to be incorrect, as I learned in my own examination of the data dictionary and confirmed by reading discussions on Kaggle. For example, the target was coded incorrectly (originally, 0 was “had heart disease”), and many of the categorical variable labels were missing.

Binary

Categorical

Ordinal

Numeric

Target

I’ll create arrays so the features are easier to access later by type.

bins = ['sex', 'fbs', 'exang']
cats = ['cp', 'restecg', 'slope', 'thal']
ords = ['ca']
nums = ['age', 'oldpeak', 'trestbps', 'chol', 'thalach']
target = ['target']

I’ll recode the categorical variables so the data exploration is easier.

df.cp = df.cp.replace({0:'Asympt.', 1:'Atypical', 2:'Non', 3:'Typical'})
df.restecg = df.restecg.replace({0:'LV hyper', 1:'Normal', 2:'ST-T wave'})
df.slope = df.slope.replace({0:'down', 1:'up', 2:'flat'})
df.thal = df.thal.replace({0:'NA', 1:'Fixed', 2:'Normal', 3:'Revers.'})

Train-test split

Before exploring the data, I’ll split them into train and test sets.

X_train, X_test, y_train, y_test = train_test_split(df,
                                                    df.target,
                                                    test_size = 0.2,
                                                    random_state = 42,
                                                    stratify = df.target)

Exploratory data analysis

Numeric and ordinal descriptives

X_train[X_train.target == 0].drop(cats + bins + target,
                                  axis=1).describe().loc[['mean', 'std']]
agetrestbpscholthalacholdpeakca
mean52.598485129.000000247.340909158.2803030.5856060.30303
std9.36605816.41494456.66182418.9290820.7592820.73036
X_train[X_train.target == 1].drop(cats + bins + target,
                                  axis=1).describe().loc[['mean', 'std']]
agetrestbpscholthalacholdpeakca
mean56.554545134.236364252.190909139.7181821.6163641.090909
std8.08508218.43606749.04154721.3774001.3146691.018593

Here we see that individuals with heart disease are/have...

Categorical and boolean counts

Categorical

fig = plt.figure(figsize=(8, 6))
fig.subplots_adjust(hspace=0.4, wspace=0.4, bottom=0.01, top=0.95)

for i, var in enumerate(cats):
    i = i + 1
    ax = fig.add_subplot(2, 2, i)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
    sns.countplot(data = X_train, x = var, hue = 'target', ax = ax)

plt.show()
Relationship between categorical variables and outcome.

Here we see that individuals with heart disease are/have...

Combining sparse classes

Some of the categorical feature classes are sparse. There are instances where it might also make conceptual sense to collapse them. For example, the resting ECG test [restecg] has values that essentially correspond with “normal” and “abnormal”. The “ST-T wave” class is relatively rare, but it behaves similar to “LV hyper”. So this is a case where it might make sense to collapse the two abnormal classes.

df.restecg = df.restecg.replace({'Normal':0, 'LV hyper':1, 'ST-T wave':1})
df.thal = df.thal.replace({'NA':0, 'Normal':0, 'Fixed': 1, 'Revers.': 1})
X_train, X_test, y_train, y_test = train_test_split(df,
                                                    df.target,
                                                    test_size = 0.2,
                                                    random_state = 42,
                                                    stratify = df.target)

We’ll also re-classify the features for which we just now collapsed classes to binary, and remove them from the categorical feature list.

bins = ['sex', 'fbs', 'exang', 'thal', 'restecg']
cats = ['cp', 'slope']

Binary

fig = plt.figure(figsize=(8, 6))
fig.subplots_adjust(hspace=0.4, wspace=0.4, bottom=0.01, top=0.95)

for i, var in enumerate(bins):
    i = i + 1
    ax = fig.add_subplot(2, 3, i)
    sns.countplot(data = X_train, x = var, hue = 'target', ax = ax)

plt.show()
Relationship between binary variables and outcome.

Here we see that individuals with heart disease are/have...

Feature encoding

Next, we’ll need to encode the variables. For the numeric variables [nums] we’ll use a StandardScaler. This will “mean center” the variable such that it has a mean of 0 and an SD of 1. Each value will then represent a stnadardized difference from the mean. For the categorical variables [cats], we’ll use one-hot encoding. This creates N-1 new columns for each of the values represented in a feature, and those new columns contain binary values to encode presence or absence of the feature-value.

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
clt = make_column_transformer(
    (StandardScaler(), nums),
    (OneHotEncoder(), cats)
)

clt.fit(X_train)
X_train_transformed = clt.transform(X_train)
X_test_transformed = clt.transform(X_test)

Logistic regression

Side-note: When I tried fitting a Logistic Regression model to the data before standardizing the numeric variables in the previous step, it would not converge. If it had converged, it probably would have given too much weight to those variables.

Fit the model

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train_transformed, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Score the model

lr.score(X_test_transformed, y_test)
0.7540983606557377

75% is not terrible, but not terribly great either. Generally 80% and above is considered acceptable, though of course it depends on the application, and specific kinds of errors might be important here. We fell short of this rule of thumb but not by a huge margin.

Confusion matrix

from sklearn.metrics import confusion_matrix
y_pred = lr.predict(X_test_transformed)
confusion_matrix = confusion_matrix(y_test, y_pred)
confusion_matrix
array([[28,  5],
       [10, 18]])

Here we can see where the mistakes were made. We made 46 (28+18) correct predictions and 15 (10+5) incorrect predictions. 5 errors were false-positive errors and 10 were false-negative errors. Depending on how we plan to use this data, one of these types of errors might be more important than the other. For example, maybe we want to minimize false-negatives, because we don’t want to run the risk of missing a diagnosis and taking proactive measures to treat it.