When Kaggle announced its 2019 Data Science Bowl and I saw the headline: “Uncover the factors to help measure how young children learn”, I knew I had to participate.
I’ve been working through this competition in stages which I’ll list below.
Part 1 - I begin by examining the data at a basic level and dealing with some of the challenges of the fact that it is “big data” which means that if I want to explore it on my local machine (which doesn’t have a lot of compute power) then I will have to wrangle it into a special format (a Spark dataframe).
Part 2 - I discover that the dataframe contains unstructured text in the form of JSON. I proceed to devise a solution to extracting this data in a way that will make it easier to explore.
Part 3 - I begin to explore the data, focusing on the “Game” session type, with the goal of trying to understand what exactly constitutes a game. In doing this, I discover an edge case with my JSON extraction function developed in the last part, and spend time fixing this.
Part 4 - I continue to explore the Game data. I try to questions about levels, level progression (are higher levels more difficult, and does difficulty climb indefinitely?), and basic game events (e.g., what determines when a game session is “finished”?).
Part 5 - I start to look at the “Assessment” type of game session. I write an algorithm to score Assessments and check my scores against the “ground truth” provided in the competition documents.
At this point I realize that I will need to start working on Kaggle itself because the competition submissions require notebook-based submissions. So I start to write some Kaggle notebooks. These were some of the notebooks I created:
Simple R median baseline model - I start with a simple model that predicts Assessment performance in the test set using median assessment scores in the training set
Basic random forest model - I create a baseline random forest model using only the Assessment title as a predictor and obtain similar performance to the median model.
Exploring accuracy on Games - I start to explore the possibility that Game performance might help predict performance on the Assessments and compare the fit of a baseline Random Forest model against one that includes game performance.
Random forest with performance history - I take what I’ve learned so far and apply it to predicting test performance. This model is a Random Forest that includes game performance. This model is a slight improvement over the baseline model.