MNIST digit recognition using a convolutional neural net (CNN)

I’ve been working my way through the TensorFlow in Practice Specialization on Coursera. I’m learning how to use neural networks to solve problems like image recognition. I decided to take a break from the course and try applying what I’ve learned so far to one of the Kaggle competitions. The MNIST is a database of more than 50,000 handwritten numbers. The goal, usually, is to train a model that can be used for digit recognition.

Solving this problem is something of a rite of passage for a data scientist, so I figured I’d take a crack at it, applying what I’ve learned about Convolutional Neural Networks (CNN). TL;DR: I was able to achieve 99% out-of-sample prediction accuracy.

The code below implements the neural net, using python and keras/tensorflow.

Import / read the data in

# Download data from Kaggle
#!kaggle competitions download digit-recognizer -f test.csv
#!kaggle competitions download digit-recognizer -f train.csv

# Unzip the files
#import zipfile
#with zipfile.ZipFile('train.csv.zip', 'r') as zip_ref:
#    zip_ref.extractall()
#with zipfile.ZipFile('test.csv.zip', 'r') as zip_ref:
#    zip_ref.extractall()
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

Preprocessing

First we’ll split the training data into training and validation sets, while stratifying on the label.

x_train, x_validation, y_train, y_validation = train_test_split(train, 
    train['label'], 
    test_size=0.2, 
    random_state=42,
    stratify=train['label'])

Convert image data to 28x28 arrays

First we’ll convert the dataframe to a format that can be used in keras/tensorflow. Currently the data is in a pandas dataframe, which is common for Kaggle datasets. We can see, for example, that the training dataset contains 42,000 rows of 785 columns. The first column contains the label, and the 784 columns after that contain the pixel values.

We want to put the label in a one array, and then take the pixel data and put it in a separate array of 28x28 arrays, and that image data should have this shape in the end: (42000, 28, 28). We’ll apply the same image data transformation to the test data.

print(train.shape)
train.head()
(42000, 785)

label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns

# We'll create a "split" of training, for cross-validation
train_images_split = []
train_labels_split = []
validation_images = []
validation_labels = []

# We'll create a full training set
train_images_full = []
train_labels_full = []

# And the test set
test_images = []

for index, row in x_train.iterrows():
    train_images_split.append(row.values[1 : ].reshape((28, 28)))
    train_labels_split.append(row['label'])
    train_images_full.append(row.values[1 : ].reshape((28, 28)))
    train_labels_full.append(row['label'])
    
for index, row in x_validation.iterrows():
    validation_images.append(row.values[1 : ].reshape((28, 28)))
    validation_labels.append(row['label'])
    train_images_full.append(row.values[1 : ].reshape((28, 28)))
    train_labels_full.append(row['label'])

for index, row in test.iterrows():
    test_images.append(row.values.reshape((28, 28)))
    
# Convert numpy array, while normalizing the image data
train_labels_split = np.array(train_labels_split)
train_images_split = np.array(train_images_split) / 255.
validation_labels = np.array(validation_labels)
validation_images = np.array(validation_images) / 255.
train_labels_full = np.array(train_labels_full)
train_images_full = np.array(train_images_full) / 255.
test_images = np.array(test_images) / 255.

print(train_labels_full.shape)
print(train_images_full.shape)
print(train_labels_split.shape)
print(train_images_split.shape)
print(validation_labels.shape)
print(validation_images.shape)
print(test_images.shape)
(42000,)
(42000, 28, 28)
(33600,)
(33600, 28, 28)
(8400,)
(8400, 28, 28)
(28000, 28, 28)

Let’s plot some digits to make sure we did that correctly.

fig, axs = plt.subplots(2, 2)
axs[0, 0].imshow(train_images_full[42], cmap='gray')
axs[0, 1].imshow(train_images_full[7], cmap='gray')
axs[1, 0].imshow(train_images_full[2020], cmap='gray')
axs[1, 1].imshow(train_images_full[0], cmap='gray')
<matplotlib.image.AxesImage at 0x7f9d45ee63c8>

Convolutional Neural Network (CNN) with keras/tf

Finally, we’ll train a simple Convolutional Neural Network (CNN) using keras/tensorflow. We’ll set it up to do early stopping once the training accuracy reaches 99.5%, then we’ll look at validation accuracy at that point.

# Callback function so we can stop training once we've reached a desired level of accuracy.
class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
    if(logs.get('accuracy')>=0.995):
      print("\n99.5% accuracy reached, stopping!")
      self.model.stop_training = True
# Add an extra dimension
train_images = np.expand_dims(train_images_split, axis=3)
validation_images = np.expand_dims(validation_images, axis=3)
# Model definition
model = tf.keras.models.Sequential([
    # Convolutional layer 1
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPooling2D(2, 2),
    # Convolutional layer 2
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    # Final layers
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation=tf.nn.relu),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

# Model compiler
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Generators
train_gen = ImageDataGenerator().flow(train_images, train_labels_split, batch_size=32)
validation_gen = ImageDataGenerator().flow(validation_images, validation_labels, batch_size=32)

# Model fit
history = model.fit(train_gen,
                    validation_data = validation_gen,
                    steps_per_epoch = len(train_images) / 32,
                    validation_steps = len(validation_images) / 32,    
                    epochs=10, 
                    callbacks=[myCallback()])
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
Train for 1050.0 steps, validate for 262.5 steps
Epoch 1/10
1050/1050 [==============================] - 19s 18ms/step - loss: 0.1458 - accuracy: 0.9563 - val_loss: 0.0502 - val_accuracy: 0.9851
Epoch 2/10
1050/1050 [==============================] - 26s 24ms/step - loss: 0.0422 - accuracy: 0.9873 - val_loss: 0.0560 - val_accuracy: 0.9829
Epoch 3/10
1050/1050 [==============================] - 28s 27ms/step - loss: 0.0314 - accuracy: 0.9894 - val_loss: 0.0478 - val_accuracy: 0.9869
Epoch 4/10
1050/1050 [==============================] - 27s 26ms/step - loss: 0.0197 - accuracy: 0.9926 - val_loss: 0.0482 - val_accuracy: 0.9862
Epoch 5/10
1050/1050 [==============================] - 21s 20ms/step - loss: 0.0173 - accuracy: 0.9942 - val_loss: 0.0440 - val_accuracy: 0.9876
Epoch 6/10
1048/1050 [============================>.] - ETA: 0s - loss: 0.0108 - accuracy: 0.9967
99.5% accuracy reached, stopping!
1050/1050 [==============================] - 24s 23ms/step - loss: 0.0108 - accuracy: 0.9967 - val_loss: 0.0494 - val_accuracy: 0.9880

We reached 99.5% training accuracy and the validation accuracy was very close!

We can visualize the performance of the model (training vs. validation) over the epochs.

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'r', label='Training accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>

As we see, the validation accuracy sort of peaked after the second epoch, after which there was marginal improvement, but by epoch 3 we begin to see some overfitting (i.e., improvement in model fit to the training data without much improvement on the validation set) before our callback method eventually stopped the training.

Re-train with all data

Now that we’re sure our approach works, we’ll re-train the model using all of the training data available.

# Redefine training images and labels, using full sets
train_images = np.expand_dims(train_images_full, axis=3)
train_labels = train_labels_full

# Re-define generator
train_gen = ImageDataGenerator().flow(train_images, train_labels, batch_size=32)

# Model fit
history = model.fit(train_gen,
                    steps_per_epoch = len(train_images) / 32,  
                    epochs=10, 
                    callbacks=[myCallback()])
WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
Train for 1312.5 steps
Epoch 1/10
1313/1312 [==============================] - 28s 21ms/step - loss: 0.0183 - accuracy: 0.9946
Epoch 2/10
1310/1312 [============================>.] - ETA: 0s - loss: 0.0101 - accuracy: 0.9969
99.5% accuracy reached, stopping!
1313/1312 [==============================] - 24s 18ms/step - loss: 0.0101 - accuracy: 0.9969

Submission

Now that we have our model finished, we can generate predictions on the test set and prepare our submission.

test_images = np.expand_dims(test_images, axis=3)
preds = model.predict_classes(test_images)
my_submission = pd.concat([pd.Series(range(1,28001), name = "ImageId")],axis = 1)
my_submission['Label'] = preds
my_submission.to_csv("my_submission.csv", index=False)