2019 Data Science Bowl - Part 1

➡ Return to tylerburleigh.com

In this series, I’ll share my progress in the Kaggle 2019 Data Science Bowl competition. I knew I had to give this competition a shot when I saw the headline: “Uncover the factors to help measure how young children learn”.

In Part 1, I’ll go over reading in the data and dealing with some of the challenges of “big data” and wrangling it into a format that can be more easily handled on a less powerful machine like mine.

Part 1 - Reading in the data

I’ve already gone to the data page for the 2019 Data Science Bowl competition and downloaded the data as a zip file. So now I’ll load the data into memory.

I saw on the data page that the total size of the data files is 438MB (compressed). This tells me that I will probably run into memory problems when loading or wrangling the data, because tabular data is usually highly compressable (so it’s probably MUCH bigger than 438MB).

Before starting I’ll load tidyverse packages because I know I’ll be using it.

load_quietly <- function(package) { # Quietly load package
  suppressWarnings(suppressMessages(library(deparse(substitute(package)), character.only=TRUE))) 
}
load_quietly(tidyverse)

I unzipped the files to /data. Next I’ll list the files and file sizes so I can see what I’m dealing with.

files <- list.files('data')
files_df <- data.frame(files = files, size = NA)
for (i in 1:length(files)){
  file.size(paste0('data/', files[i])) %>%
    utils:::format.object_size(., "auto") -> size
  files_df[i,]$size <- size
}
files_df
##                        files     size
## 1 data-science-bowl-2019.zip 437.6 Mb
## 2      sample_submission.csv  10.8 Kb
## 3                  specs.csv 399.3 Kb
## 4                   test.csv 379.9 Mb
## 5                  train.csv   3.6 Gb
## 8           train_labels.csv   1.1 Mb

Yowza! The training data file is 3.6GB. My machine only has 8GB total and… checks task manager as we speak only 1.7GB of that is available. So that means reading the data into memory for analysis is maybe not the best option.

Instead of reading the data into memory all at once, I can use sparklyr, an R interface for Apache Spark, which is a framework built for “big data” situations like this.

Fortunately I’ve already installed the dependencies for this because I was using it in a Coursera course project.

load_quietly(sparklyr)
sc <- spark_connect(master = "local")

Once I have the connection (sc) I can load the csv files into Spark.

train <- spark_read_csv(sc, 'data/train.csv')

The nice thing about sparklyr is that all of the tidyverse syntax and dplyr verbs that I’m familiar will work with these dataframes:

train %>% head()
## # Source: spark<?> [?? x 11]
##   event_id game_session timestamp           event_data installation_id
##   <chr>    <chr>        <dttm>              <chr>      <chr>          
## 1 27253bdc 45bb1e1b6b5~ 2019-09-06 17:53:46 "\"{\"\"e~ " \"\"event_co~
## 2 27253bdc 17eeb7f2236~ 2019-09-06 17:54:17 "\"{\"\"e~ " \"\"event_co~
## 3 77261ab5 0848ef14a8d~ 2019-09-06 17:54:56 "\"{\"\"v~ "\"\"event_cou~
## 4 b2dba42b 0848ef14a8d~ 2019-09-06 17:54:56 "\"{\"\"d~ " fill up your~
## 5 1bb5fbdb 0848ef14a8d~ 2019-09-06 17:55:03 "\"{\"\"d~ " fill up your~
## 6 1325467d 0848ef14a8d~ 2019-09-06 17:55:06 "\"{\"\"c~ "\"\"y\"\":605"
## # ... with 6 more variables: event_count <chr>, event_code <chr>,
## #   game_time <chr>, title <chr>, type <chr>, world <chr>

Okay looking at some of these rows of data and comparing them to the data viewer on Kaggle for train.csv I can already tell that something went wrong. It looks like the event_data column contains a JSON, and when it was being read into the Spark container the commas caused it to break across the other columns. This is probably because the default delimiter is ,.

Let’s read a few lines of the file in its raw form to see if maybe it’s using a different delimiter.

readLines("data/train.csv", 2)
## [1] "event_id,game_session,timestamp,event_data,installation_id,event_count,event_code,game_time,title,type,world"                                                 
## [2] "27253bdc,45bb1e1b6b50c07b,2019-09-06T17:53:46.937Z,\"{\"\"event_code\"\": 2000, \"\"event_count\"\": 1}\",0001e90f,1,2000,0,Welcome to Lost Lagoon!,Clip,NONE"

Yikes. Okay, so it looks like a comma is being used as a delimiter for the csv. In that case Maybe I can read all columns except event_data using the columns argument? Then I’ll read event_data by itself afterwards. Ah! From the docs: columns is for renaming columns, not for specifying which to read.

In that case, I’ll have to modify the file before reading it into sparklyr. I’ll use the data.table::fread function to load it into memory, and then save it back to my filesystem with a different (unambiguous) delimiter.

load_quietly(data.table)
train <- fread('data/train.csv')
train %>% head()
##    event_id     game_session                timestamp
## 1: 27253bdc 45bb1e1b6b50c07b 2019-09-06T17:53:46.937Z
## 2: 27253bdc 17eeb7f223665f53 2019-09-06T17:54:17.519Z
## 3: 77261ab5 0848ef14a8dc6892 2019-09-06T17:54:56.302Z
## 4: b2dba42b 0848ef14a8dc6892 2019-09-06T17:54:56.387Z
## 5: 1bb5fbdb 0848ef14a8dc6892 2019-09-06T17:55:03.253Z
## 6: 1325467d 0848ef14a8dc6892 2019-09-06T17:55:06.279Z
##                                                                                                                                                                                                                                                                                                                      event_data
## 1:                                                                                                                                                                                                                                                                                   {""event_code"": 2000, ""event_count"": 1}
## 2:                                                                                                                                                                                                                                                                                   {""event_code"": 2000, ""event_count"": 1}
## 3:                                                                                                                                                                                                                                                  {""version"":""1.0"",""event_count"":1,""game_time"":0,""event_code"":2000}
## 4: {""description"":""Let's build a sandcastle! First, fill up your mold with sand! You can use the shovel here. The mold gives the sand its shape!"",""identifier"":""Dot_LetsSandcastle,Dot_FillMold,Dot_MoldShape"",""media_type"":""audio"",""total_duration"":6758,""event_count"":2,""game_time"":53,""event_code"":3010}
## 5:     {""description"":""Let's build a sandcastle! First, fill up your mold with sand! You can use the shovel here. The mold gives the sand its shape!"",""identifier"":""Dot_LetsSandcastle,Dot_FillMold,Dot_MoldShape"",""media_type"":""audio"",""duration"":6919,""event_count"":3,""game_time"":6972,""event_code"":3110}
## 6:                                                                                                                                                                                   {""coordinates"":{""x"":583,""y"":605,""stage_width"":1015,""stage_height"":762},""event_count"":4,""game_time"":9991,""event_code"":4070}
##    installation_id event_count event_code game_time
## 1:        0001e90f           1       2000         0
## 2:        0001e90f           1       2000         0
## 3:        0001e90f           1       2000         0
## 4:        0001e90f           2       3010        53
## 5:        0001e90f           3       3110      6972
## 6:        0001e90f           4       4070      9991
##                            title     type     world
## 1:       Welcome to Lost Lagoon!     Clip      NONE
## 2:          Magma Peak - Level 1     Clip MAGMAPEAK
## 3: Sandcastle Builder (Activity) Activity MAGMAPEAK
## 4: Sandcastle Builder (Activity) Activity MAGMAPEAK
## 5: Sandcastle Builder (Activity) Activity MAGMAPEAK
## 6: Sandcastle Builder (Activity) Activity MAGMAPEAK

Ok so fread managed to avoid the delimiter problem. That’s good news. Now to save it back to disk with a different delimiter. I’ll use |. (Note: I could have tried sparklyr::copy_to() to copy from the dataframe, but that has major performance issues in my experience).

fwrite(train, file = 'data/train_fix.txt', sep = '|')
train_s <- spark_read_csv(sc, 'data/train_fix.txt', delimiter = '|')
train_s %>% head()
## # Source: spark<?> [?? x 11]
##   event_id game_session timestamp           event_data installation_id
##   <chr>    <chr>        <dttm>              <chr>      <chr>          
## 1 27253bdc 45bb1e1b6b5~ 2019-09-06 17:53:46 "\"{\"\"\~ 0001e90f       
## 2 27253bdc 17eeb7f2236~ 2019-09-06 17:54:17 "\"{\"\"\~ 0001e90f       
## 3 77261ab5 0848ef14a8d~ 2019-09-06 17:54:56 "\"{\"\"\~ 0001e90f       
## 4 b2dba42b 0848ef14a8d~ 2019-09-06 17:54:56 "\"{\"\"\~ 0001e90f       
## 5 1bb5fbdb 0848ef14a8d~ 2019-09-06 17:55:03 "\"{\"\"\~ 0001e90f       
## 6 1325467d 0848ef14a8d~ 2019-09-06 17:55:06 "\"{\"\"\~ 0001e90f       
## # ... with 6 more variables: event_count <int>, event_code <int>,
## #   game_time <int>, title <chr>, type <chr>, world <chr>

Alright! It looks like the columns are all there now. event_data still has some problems, specifically it’s not properly formatted json, but it’s at least in a consistent format where it can be wrangled into something useful now.

Now to save this to a local file in the parquet file format, for easier reading back into Spark.

spark_write_parquet(train_s, 'data/train.parquet')

So in the future I can just call spark_read_parquet() to read in the file.

train_s <- spark_read_parquet(sc, 'data/train.parquet')

➡ Go to Part 2