2019 Data Science Bowl - Part 3 - Exploring ‘Game’ data

➡ Return to tylerburleigh.com

OK time to explore some of the data! In Part 2 I worked through some of the challenges with parsing the JSON data. In this part I will start to explore the data associated with the “Game” event type (and apparently I will also get side-tracked into fixing the extract_json function :)

load_quietly <- function(package) { # Quietly load package
  suppressWarnings(suppressMessages(library(deparse(substitute(package)), character.only=TRUE))) 
}
load_quietly(tidyverse)
load_quietly(data.table)
load_quietly(jsonlite)
load_quietly(sparklyr)
load_quietly(DT)

source("R/extract_json.R")

Start the Spark container.

sc <- spark_connect(master = "local")
train_s <- spark_read_parquet(sc, 'data/train.parquet', memory = FALSE)

Event types

First, there are 4 different types of events:

Let’s look a bit at the Game type.

Games

train_s %>% 
  filter(type == 'Game') %>%
  head
## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
## Please use `eval_tidy()` with a data mask instead.
## This warning is displayed once per session.
## Warning: `overscope_clean()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
## # Source: spark<?> [?? x 11]
##   event_id game_session timestamp           event_data installation_id
##   <chr>    <chr>        <dttm>              <chr>      <chr>          
## 1 6d90d394 f3bd0e36349~ 2019-08-04 15:18:35 "\"{\"\"\~ eca4e454       
## 2 7040c096 f3bd0e36349~ 2019-08-04 15:18:37 "\"{\"\"\~ eca4e454       
## 3 5a848010 f3bd0e36349~ 2019-08-04 15:18:39 "\"{\"\"\~ eca4e454       
## 4 c1cac9a2 f3bd0e36349~ 2019-08-04 15:18:40 "\"{\"\"\~ eca4e454       
## 5 dcaede90 f3bd0e36349~ 2019-08-04 15:18:42 "\"{\"\"\~ eca4e454       
## 6 26fd2d99 f3bd0e36349~ 2019-08-04 15:18:42 "\"{\"\"\~ eca4e454       
## # ... with 6 more variables: event_count <int>, event_code <int>,
## #   game_time <int>, title <chr>, type <chr>, world <chr>

Game titles

There are 11 game titles. In the analysis it will be important to analyze these separately. I wonder if they are traching different skills? It will be interesting to see when examining the Assessments, if there’s a 1:1 mapping from Game to Assessment.

train_s %>% 
  filter(type == 'Game') %>%
  distinct(title)
## # Source: spark<?> [?? x 1]
##    title           
##    <chr>           
##  1 Air Show        
##  2 Scrub-A-Dub     
##  3 Dino Dive       
##  4 Crystals Rule   
##  5 All Star Sorting
##  6 Pan Balance     
##  7 Bubble Bath     
##  8 Chow Time       
##  9 Happy Camel     
## 10 Dino Drink      
## # ... with more rows

Looking at a game session

In the analysis it will be important to analyze game sessions as a unit. Let’s look at a single game session.

train_s %>%
  filter(type == 'Game') %>%
  head() %>% 
  select(game_session)
## # Source: spark<?> [?? x 1]
##   game_session    
##   <chr>           
## 1 f3bd0e3634993d85
## 2 f3bd0e3634993d85
## 3 f3bd0e3634993d85
## 4 f3bd0e3634993d85
## 5 f3bd0e3634993d85
## 6 f3bd0e3634993d85
train_s %>%
  filter(game_session == 'f3bd0e3634993d85') %>%
  collect(.) %>%
  head(25) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 3, scrollX = '400px'))

So this is a look at a single game session. The session is unique for a single installation_id which is our user identifier. It contains 635 events and those events occured in a sequence defined by event_count. The time is kept in milliseconds by game_time and a timestamp is also available in timestamp. Some of the events occurred simultaneously with others (for example, look at rows 5-7).

It’s interesting that both millisecond time and actual timestamp time is provided. I can imagine scenarios where timestamp data could be useful. For example, by looking at the time between game sessions, or the time of day during which the game sessions took place (daytime, evening, lunch time). I’ll put a pin in these for later.

Game session event_data

In Part 2 I figured out how to extract the JSON data and wrote a function (extract_json()) for re-use. This means I can look at the features of game events. Let’s try extracting the JSON data in this game session.

train_s %>%
  filter(game_session == 'f3bd0e3634993d85') -> tmp

extract_json(tmp) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 3, scrollX = '400px'))

Will it break if I try running it over the entire game session? Let’s see.

tryCatch({
    extract_json(tmp, nrows = sdf_nrow(tmp))
}, error = function(x) print(x[[1]]))
## [1] "arguments imply differing number of rows: 1, 3, 2"

The answer is yes. Ouch. But why?

I didn’t expect the function to scale all that well, but surely it can handle less than 1000 rows of data? I wonder if it’s maybe tripping up at a certain row because it’s encountering something in the JSON that I wasn’t anticipating, and if it can be fixed. Let’s try running it in smaller batches to identify where it’s breaking down.

for(i in seq(from = 1, to = sdf_nrow(tmp), by = 25)){
  cat("\f")
  cat(i)
  
  look_ahead <- i + 25
  
  tmp %>%
    sdf_with_sequential_id(id = "index") %>%
    filter(index >= i, index <= look_ahead) -> tmp2
  
  tryCatch({
    extract_json(tmp2, nrows = 25)
  }, error = function(x) x[[1]]) -> x
  
  if(x == "arguments imply differing number of rows: 1, 3, 2"){
    print(x)
    break
  }
  
}

51[1] "arguments imply differing number of rows: 1, 3, 2"

The code above breaks on the second iteration, which corresponds with rows 51 to 76. What is it about these rows? Next I’ll go row by row to see exactly which row is causing the problem.

tmp %>%
  sdf_with_sequential_id(id = "index") %>%
  filter(index >= 51, index <= 76) -> tmp39
tmp %>%
  sdf_with_sequential_id(id = "index") %>%
  filter(index >= 51, index <= 76) -> tmp39

for(i in 1:25){
  cat("\f")
  cat(i)
  
  tmp39 %>%
    select(-index) %>%
    sdf_with_sequential_id(id = "index") %>%
    filter(index == i) -> tmp3
  
  tryCatch({
    extract_json(tmp3, nrows = 1)
  }, error = function(x) x[[1]]) -> x
  
  if(x == "arguments imply differing number of rows: 1, 3, 2"){
    print(x)
    break
  }
  
}
13[1] "arguments imply differing number of rows: 1, 3, 2"

This time it breaks on the 13th iteration. Lucky 13! Let’s see what’s going on there.

tmp39 %>%
  select(-index) %>%
  sdf_with_sequential_id(id = "index") %>%
  filter(index == 13) -> lucky13
lucky13
## # Source: spark<?> [?? x 12]
##   event_id game_session timestamp           event_data installation_id
##   <chr>    <chr>        <dttm>              <chr>      <chr>          
## 1 26fd2d99 f3bd0e36349~ 2019-08-04 15:19:43 "\"{\"\"\~ eca4e454       
## # ... with 7 more variables: event_count <int>, event_code <int>,
## #   game_time <int>, title <chr>, type <chr>, world <chr>, index <dbl>

Is the JSON malformed?

lucky13 %>% 
  collect(.) %>%
  select(event_data) %>%
  as.character() %>%
  gsub('\"\"\"\"', '"', ., fixed=T) %>% 
  gsub('\"{', '{', ., fixed=T) %>% 
  gsub('}\"', '}', ., fixed=T) -> json_unescaped

json_unescaped
## [1] "{\"round_target\":{\"size\":4,\"type\":\"Tub\",\"animal\":\"cow\"},\"options\":[1,4,3],\"animals\":[\"cow\",\"hog\"],\"level\":4,\"round\":8,\"event_count\":64,\"game_time\":68227,\"event_code\":2020}"
fromJSON(json_unescaped)
## $round_target
## $round_target$size
## [1] 4
## 
## $round_target$type
## [1] "Tub"
## 
## $round_target$animal
## [1] "cow"
## 
## 
## $options
## [1] 1 4 3
## 
## $animals
## [1] "cow" "hog"
## 
## $level
## [1] 4
## 
## $round
## [1] 8
## 
## $event_count
## [1] 64
## 
## $game_time
## [1] 68227
## 
## $event_code
## [1] 2020

It’s not malformed, but some of the elements (e.g., options) are arrays. I wasn’t expecting that. I bet that’s tripping it up because it’s not truly “flattened”.

I think I can just replace the array brackets [ and ] with quotes, turning them into a list.

[2 hours passes…]

The hard part was replacing the " characters when they appeared in square brackets, but not when they appeared elsewhere. I couldn’t get a RegEx to work on my own, so I asked on StackOverflow.

json_unescaped %>%
  gsub('(?:\\G(?!\\A)|\\[)[^]"]*\\K"', "", ., perl = T) %>% # Credit: https://stackoverflow.com/a/58633003/2051312
  gsub('[', '"', ., fixed = T) %>%
  gsub(']', '"', ., fixed = T) %>%
  fromJSON()
## $round_target
## $round_target$size
## [1] 4
## 
## $round_target$type
## [1] "Tub"
## 
## $round_target$animal
## [1] "cow"
## 
## 
## $options
## [1] "1,4,3"
## 
## $animals
## [1] "cow,hog"
## 
## $level
## [1] 4
## 
## $round
## [1] 8
## 
## $event_count
## [1] 64
## 
## $game_time
## [1] 68227
## 
## $event_code
## [1] 2020

Notice how animals and options are now a comma-separated lists. This is what we want. Now it should work in context of the JSON extraction function. I’ll add it now and then see if it works.

# extract_json_2 
# spark_df: the spark dataframe
# nrows: the number of rows to return
#
# Now with nested json flattening!
extract_json_2 <- function(spark_df, nrows = 5){
  
  spark_df %>%
    head(nrows) %>%
    collect(.) -> tmp_df
  
  spark_df %>%
    head(nrows) %>%
    spark_apply(function(df) {
      library(dplyr)
      
      remove_double_escape <- function(e){
          gsub('\"\"\"\"', '"', e, fixed=T) %>% 
          gsub('\"{', '{', ., fixed=T) %>% 
          gsub('}\"', '}', ., fixed=T) %>%
          gsub('(?:\\G(?!\\A)|\\[)[^]"]*\\K"', "", ., perl = T) %>% # Credit: https://stackoverflow.com/a/58633003/2051312
          gsub('[', '"', ., fixed = T) %>%
          gsub(']', '"', ., fixed = T)
      }
      
      df %>% mutate(tmp = remove_double_escape(event_data)) 
      
    }) %>%
    collect(.) %>%
    rowwise() %>%
    do(data.frame(fromJSON(.$tmp, flatten = T))) %>%
    bind_cols(tmp_df, .) %>%
    select(-event_data)
}

Now I’ve fixed the function, can I run it over the entire game session?

extract_json_2(tmp, nrows = sdf_nrow(tmp)) -> tmp2
tmp2 %>% 
  head(20) %>%
  datatable(., rownames = FALSE, options = list(pageLength = 3, scrollX = '400px'))

Great! Now, what can I see about the game session now that I have it all available? Well, the first thing I notice is that some of the events are being flagged as correct or misses. This tells me that the variables I will need to score performance are within the jSON. The folks who prepared this dataset didn’t want to make it too easy, did they?

Well that's enough for now. More exploration to come!

➡ Go to Part 4